Re: implementing a plugin to process the whole input document

[email protected] Fri, 23 May 2014 10:56:38 -0700

In answer to (1), in each custom mapper, you have access to ParseContext in
the method


public void parse(ParseContext context) throws IOException

In the ParseContext, you can access _source with the source() method to do
whatever you want, e.g. copy it, parse it, index it again etc.

(2) is a slight misconception, since _source is not a field, but a "field
container", it is a byte array passed through the ES API so the field
mappers can do their work.

(3) as said, it is possible to copy _source, but only internally in the
code of a custom field mapper, not by configuration in the mapping, since
_source is reserved for special treatment inside ES and users should not be
able to tamper with it.

So a customized mapper in a plugin could work like this in the root object:

 "mappings" : {
      "properties" : {
           ...
           "_siren" : { "type" : "siren" }
      }
}

and in the corresponding code in the custom mapper, when field _siren is
processed because of the type "siren", it copies the byte array from
_source in the ParseContext. (It need not to be the field name _siren this
is just an example name)

Jörg




On Fri, May 23, 2014 at 5:38 PM, Jakub Kotowski <[email protected]>wrote:

> Hi Jörg,
>
> thanks for the reply. Yes, what you suggest is a way to improve our
> current approach so that we can get a subdoc instead of a json encoded in a
> string field.
>
> What we would like to achieve is to always be able to process any document
> that comes to elasticsearch as a whole, i.e. be it { "title": "my title",
> "content" : "my content"} or {"name" : "john", "surname" : "doe"}.
>
> For that we either (1) need to be able to set an analyzer for the whole
> input document or (2) set an analyzer for the _source field which already
> contains the whole doc or (3) copy the _source field to a normal field,
> let's say _siren, and set an analyzer for it.
>
> (1) and (2) seem to be impossible.
>
> So we are exploring option (3) which also seems difficult.
>
> Jakub
>
>
> On Friday, May 23, 2014 4:24:39 PM UTC+1, Jörg Prante wrote:
>
>> Not sure what the plugin is doing, but if you want to process dedicated
>> JSON data in an ES document, you could prepare an analyzer for a new field
>> type. So user can assign special meaning in the mapping to a field of their
>> preference.
>>
>> E.g.  a mapping with
>>
>>      "mappings: {
>>          "mycontent" : { "type" : "siren" }
>>     }
>>
>> and a given document would look like
>>
>>     "mycontent" : {
>>          "title" : "foo",
>>          "name" : "bar"
>>          ...
>>     }
>>
>>
>> and then you could extract the whole JSON subdoc from the doc under
>> "mycontent" into your analyzer plugin and process it.
>>
>> For an example, you could look into plugins like the StandardNumber
>> analyzer, where I defined a new type "standardnumber" for analysis:
>>
>> https://github.com/jprante/elasticsearch-analysis-
>> standardnumber/blob/master/src/main/java/org/xbib/
>> elasticsearch/index/mapper/standardnumber/StandardNumberMapper.java
>>
>> Jörg
>>
>>
>>
>> On Fri, May 23, 2014 at 4:48 PM, Jakub Kotowski <[email protected]>wrote:
>>
>>> Hello all,
>>>
>>> we are trying to implement a SIREn plugin for ElasticSearch for indexing
>>> and querying documents. We already implemented a version which uses SIREn
>>> to index and query a specific field (called "contents" below) which
>>> contains a JSON document as a string. An example of a doc:
>>>
>>> {
>>>    "id":3,
>>>    "contents":"{\"title\":\"This is an another article
>>> about SIREn.\",\"content\":\"bla bla bla \"}"
>>> }
>>>
>>>
>>> Instead, we would like to index the whole document as it is posted to
>>> ElasticSearch to avoid the need for a special loader that transforms an
>>> input JSON to the required form. So then the user would simply post a
>>> document such as:
>>>
>>> {
>>>    "id":3,
>>>    "title":"This is an another article  about SIREn.",
>>>    "content": "bla bla bla "
>>> }
>>>
>>> and it would be indexed as a whole both by ElasticSearch and by the
>>> SIREn plugin.
>>>
>>> One problem we encountered is that it is not possible to use copyTo for
>>> the _source field and then only configure an analyzer for the copy.
>>>
>>>  It seems that the cleanest solution would be to modify the
>>> SourceFieldMapper class to allow copyTo.
>>>
>>>  As a workaround we are going to create a class that extends
>>> SourceFieldMapper and set copyTo for the _source field to a new field that
>>> will be then used for SIREn and register it as follows:
>>>
>>> mapperService.documentMapperParser().putRootTypeParser("_source", new
>>> ModifiedSourceFieldMapper.TypeParser());
>>>
>>> Does it sound OK or is there a simpler/cleaner solution?
>>>
>>> Thank you in advance,
>>>
>>> Jakub
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%
>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/352e7668-d382-4ca3-bbeb-605d6c019ed1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/e796d820-8e9c-4f94-b425-38bd5f509b51%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF5cziPGQaNDAZPfr1ZOwY0qc%2BQnQas9gsivfh3pD2O0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: implementing a plugin to process the whole input document

Reply via email to