Re: implementing a plugin to process the whole input document

Jakub Kotowski Fri, 23 May 2014 11:18:12 -0700

Great, the ParseContext looks promising.

We'll try it and report back, thanks!


Jakub

BTW, just to answer your previous implicit question - SIREn allows for 
advanced structured document search, more at http://sirendb.com/


On Friday, May 23, 2014 6:56:32 PM UTC+1, Jörg Prante wrote:
>
> In answer to (1), in each custom mapper, you have access to ParseContext 
> in the method
>
> public void parse(ParseContext context) throws IOException
>
> In the ParseContext, you can access _source with the source() method to do 
> whatever you want, e.g. copy it, parse it, index it again etc.
>
> (2) is a slight misconception, since _source is not a field, but a "field 
> container", it is a byte array passed through the ES API so the field 
> mappers can do their work.
>
> (3) as said, it is possible to copy _source, but only internally in the 
> code of a custom field mapper, not by configuration in the mapping, since 
> _source is reserved for special treatment inside ES and users should not be 
> able to tamper with it.
>
> So a customized mapper in a plugin could work like this in the root object:
>
>  "mappings" : {
>       "properties" : {
>            ...
>            "_siren" : { "type" : "siren" }
>       }
> }
>
> and in the corresponding code in the custom mapper, when field _siren is 
> processed because of the type "siren", it copies the byte array from 
> _source in the ParseContext. (It need not to be the field name _siren this 
> is just an example name)
>
> Jörg
>
>
>
>
> On Fri, May 23, 2014 at 5:38 PM, Jakub Kotowski 
> <[email protected]<javascript:>
> > wrote:
>
>> Hi Jörg,
>>
>> thanks for the reply. Yes, what you suggest is a way to improve our 
>> current approach so that we can get a subdoc instead of a json encoded in a 
>> string field.
>>
>> What we would like to achieve is to always be able to process any 
>> document that comes to elasticsearch as a whole, i.e. be it { "title": "my 
>> title", "content" : "my content"} or {"name" : "john", "surname" : "doe"}.
>>
>> For that we either (1) need to be able to set an analyzer for the whole 
>> input document or (2) set an analyzer for the _source field which already 
>> contains the whole doc or (3) copy the _source field to a normal field, 
>> let's say _siren, and set an analyzer for it.
>>
>> (1) and (2) seem to be impossible.
>>
>> So we are exploring option (3) which also seems difficult.
>>
>> Jakub 
>>
>>
>> On Friday, May 23, 2014 4:24:39 PM UTC+1, Jörg Prante wrote:
>>
>>> Not sure what the plugin is doing, but if you want to process dedicated 
>>> JSON data in an ES document, you could prepare an analyzer for a new field 
>>> type. So user can assign special meaning in the mapping to a field of their 
>>> preference.
>>>
>>> E.g.  a mapping with
>>>
>>>      "mappings: {
>>>          "mycontent" : { "type" : "siren" }
>>>     }
>>>
>>> and a given document would look like
>>>
>>>     "mycontent" : {
>>>          "title" : "foo",
>>>          "name" : "bar"
>>>          ...
>>>     }
>>>
>>>
>>> and then you could extract the whole JSON subdoc from the doc under 
>>> "mycontent" into your analyzer plugin and process it. 
>>>
>>> For an example, you could look into plugins like the StandardNumber 
>>> analyzer, where I defined a new type "standardnumber" for analysis:
>>>
>>> https://github.com/jprante/elasticsearch-analysis-
>>> standardnumber/blob/master/src/main/java/org/xbib/
>>> elasticsearch/index/mapper/standardnumber/StandardNumberMapper.java
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Fri, May 23, 2014 at 4:48 PM, Jakub Kotowski 
>>> <[email protected]>wrote:
>>>
>>>> Hello all,
>>>>
>>>> we are trying to implement a SIREn plugin for ElasticSearch for 
>>>> indexing and querying documents. We already implemented a version which 
>>>> uses SIREn to index and query a specific field (called "contents" below) 
>>>> which contains a JSON document as a string. An example of a doc:
>>>>
>>>> {
>>>>    "id":3,
>>>>    "contents":"{\"title\":\"This is an another article  
>>>> about SIREn.\",\"content\":\"bla bla bla \"}"
>>>> }
>>>>  
>>>>
>>>> Instead, we would like to index the whole document as it is posted to 
>>>> ElasticSearch to avoid the need for a special loader that transforms an 
>>>> input JSON to the required form. So then the user would simply post a 
>>>> document such as:
>>>>
>>>> {
>>>>    "id":3,
>>>>    "title":"This is an another article  about SIREn.",
>>>>    "content": "bla bla bla "
>>>> }
>>>>
>>>> and it would be indexed as a whole both by ElasticSearch and by the 
>>>> SIREn plugin.
>>>>
>>>> One problem we encountered is that it is not possible to use copyTo for 
>>>> the _source field and then only configure an analyzer for the copy.
>>>>
>>>>  It seems that the cleanest solution would be to modify the 
>>>> SourceFieldMapper class to allow copyTo. 
>>>>
>>>>  As a workaround we are going to create a class that extends 
>>>> SourceFieldMapper and set copyTo for the _source field to a new field that 
>>>> will be then used for SIREn and register it as follows:
>>>>  
>>>> mapperService.documentMapperParser().putRootTypeParser("_source", new 
>>>> ModifiedSourceFieldMapper.TypeParser());
>>>>
>>>> Does it sound OK or is there a simpler/cleaner solution?
>>>>  
>>>> Thank you in advance,
>>>>
>>>> Jakub
>>>>
>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%
>>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e30a414b-fb6f-4759-a80f-0e4ac3bf96ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: implementing a plugin to process the whole input document

Reply via email to