Hi again,

I have a followup question about the ParseContext and processing documents 
before indexing. 

Now I would need to modify a document before it is parsed by ElasticSearch.

I tried to do it by modifying context.source() but that leads to a corrupt 
index. I guess that's because context.parser is also initialized with the 
same bytearray (at least w.r.t. its contents) as context.source(). So in 
order to mutate the bytearray, I would need to do it in the parser too. The 
parser however is already started and, by the time I get to it, it already 
processed at least two tokens. That means that I could conceivably try to 
restart the parser with the modified bytearray and lead it to the same 
(resp. corresponding) state as it originally would have been thanks to 
ObjectMapper's actions on it. This is would however very clearly be a very 
fragile hack... One way of avoiding that maybe could be somehow achieving 
to be the first rootMapper executed by the ObjectMapper, but I think this 
is hardcoded and cannot be easily changed (there's no client API for it 
afaik).

Is there some way of modifying a document before ElasticSearch gets to 
parse it?

Basically, I need to send a document to ES that contains some JSON 
subobjects understood by the custom parser of our plugin and it doesn't 
make much sense for ElasticSearch to index them as they are so ideally we 
would like to transform them a bit.

Thanks for any pointers.

Jakub



On Friday, May 23, 2014 6:56:32 PM UTC+1, Jörg Prante wrote:
>
> In answer to (1), in each custom mapper, you have access to ParseContext 
> in the method
>
> public void parse(ParseContext context) throws IOException
>
> In the ParseContext, you can access _source with the source() method to do 
> whatever you want, e.g. copy it, parse it, index it again etc.
>
> (2) is a slight misconception, since _source is not a field, but a "field 
> container", it is a byte array passed through the ES API so the field 
> mappers can do their work.
>
> (3) as said, it is possible to copy _source, but only internally in the 
> code of a custom field mapper, not by configuration in the mapping, since 
> _source is reserved for special treatment inside ES and users should not be 
> able to tamper with it.
>
> So a customized mapper in a plugin could work like this in the root object:
>
>  "mappings" : {
>       "properties" : {
>            ...
>            "_siren" : { "type" : "siren" }
>       }
> }
>
> and in the corresponding code in the custom mapper, when field _siren is 
> processed because of the type "siren", it copies the byte array from 
> _source in the ParseContext. (It need not to be the field name _siren this 
> is just an example name)
>
> Jörg
>
>
>
>
> On Fri, May 23, 2014 at 5:38 PM, Jakub Kotowski <[email protected] 
> <javascript:>> wrote:
>
>> Hi Jörg,
>>
>> thanks for the reply. Yes, what you suggest is a way to improve our 
>> current approach so that we can get a subdoc instead of a json encoded in a 
>> string field.
>>
>> What we would like to achieve is to always be able to process any 
>> document that comes to elasticsearch as a whole, i.e. be it { "title": "my 
>> title", "content" : "my content"} or {"name" : "john", "surname" : "doe"}.
>>
>> For that we either (1) need to be able to set an analyzer for the whole 
>> input document or (2) set an analyzer for the _source field which already 
>> contains the whole doc or (3) copy the _source field to a normal field, 
>> let's say _siren, and set an analyzer for it.
>>
>> (1) and (2) seem to be impossible.
>>
>> So we are exploring option (3) which also seems difficult.
>>
>> Jakub 
>>
>>
>> On Friday, May 23, 2014 4:24:39 PM UTC+1, Jörg Prante wrote:
>>
>>> Not sure what the plugin is doing, but if you want to process dedicated 
>>> JSON data in an ES document, you could prepare an analyzer for a new field 
>>> type. So user can assign special meaning in the mapping to a field of their 
>>> preference.
>>>
>>> E.g.  a mapping with
>>>
>>>      "mappings: {
>>>          "mycontent" : { "type" : "siren" }
>>>     }
>>>
>>> and a given document would look like
>>>
>>>     "mycontent" : {
>>>          "title" : "foo",
>>>          "name" : "bar"
>>>          ...
>>>     }
>>>
>>>
>>> and then you could extract the whole JSON subdoc from the doc under 
>>> "mycontent" into your analyzer plugin and process it. 
>>>
>>> For an example, you could look into plugins like the StandardNumber 
>>> analyzer, where I defined a new type "standardnumber" for analysis:
>>>
>>> https://github.com/jprante/elasticsearch-analysis-
>>> standardnumber/blob/master/src/main/java/org/xbib/
>>> elasticsearch/index/mapper/standardnumber/StandardNumberMapper.java
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Fri, May 23, 2014 at 4:48 PM, Jakub Kotowski <[email protected]> 
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> we are trying to implement a SIREn plugin for ElasticSearch for 
>>>> indexing and querying documents. We already implemented a version which 
>>>> uses SIREn to index and query a specific field (called "contents" below) 
>>>> which contains a JSON document as a string. An example of a doc:
>>>>
>>>> {
>>>>    "id":3,
>>>>    "contents":"{\"title\":\"This is an another article  
>>>> about SIREn.\",\"content\":\"bla bla bla \"}"
>>>> }
>>>>  
>>>>
>>>> Instead, we would like to index the whole document as it is posted to 
>>>> ElasticSearch to avoid the need for a special loader that transforms an 
>>>> input JSON to the required form. So then the user would simply post a 
>>>> document such as:
>>>>
>>>> {
>>>>    "id":3,
>>>>    "title":"This is an another article  about SIREn.",
>>>>    "content": "bla bla bla "
>>>> }
>>>>
>>>> and it would be indexed as a whole both by ElasticSearch and by the 
>>>> SIREn plugin.
>>>>
>>>> One problem we encountered is that it is not possible to use copyTo for 
>>>> the _source field and then only configure an analyzer for the copy.
>>>>
>>>>  It seems that the cleanest solution would be to modify the 
>>>> SourceFieldMapper class to allow copyTo. 
>>>>
>>>>  As a workaround we are going to create a class that extends 
>>>> SourceFieldMapper and set copyTo for the _source field to a new field that 
>>>> will be then used for SIREn and register it as follows:
>>>>  
>>>> mapperService.documentMapperParser().putRootTypeParser("_source", new 
>>>> ModifiedSourceFieldMapper.TypeParser());
>>>>
>>>> Does it sound OK or is there a simpler/cleaner solution?
>>>>  
>>>> Thank you in advance,
>>>>
>>>> Jakub
>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/89d75c30-5aa5-49e5-a17f-90f9b38829fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to