The user copy/paste the content of an html page and me, I index this
information. I take the entire document with image. I can't change this
behavior.
I set max_gram=20. It's better but at the end I have this many times :
[2014-06-20 11:42:14,201][WARN ][monitor.jvm ] [ik-test2]
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young]
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old]
[513.4mb]->[557.8mb]/[940.8mb]}
I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?
Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit :
>
> If you are only searching in the text you should index the images in
> an other field field. With no analyzer ("index: not_analyzed"), or
> even better "index: no" (not indexed). If you need to retrieve the
> image data it's still in the _source.
>
> But to be honest I wouldn't even store this kind of information in ES,
> your index is going to be bigger, merges are going to be slower... I'd
> keep the binary files stored elsewhere.
>
> Cédric Hourcade
> [email protected] <javascript:>
>
>
> On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard
> <[email protected] <javascript:>> wrote:
> > Yes, I am applying "reuters" on my document (compose by text and
> picture).
> > My goal is to do my research on the text of the document with any word
> or
> > part of a word.
> >
> > Yes the problem it's my nGram filter.
> > How do I solve this problem ? Deacrease nGram max ? Change Analyzer by
> an
> > other but who satisfy my goal ?
> >
> > Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit :
> >>
> >> Does it mean your applying the "reuters" analyzer on your base64
> >> encoded pictures?
> >>
> >> I guess it generates a really huge number of tokens for each entry
> >> because of your nGram filter (with a max at 250).
> >>
> >> Cédric Hourcade
> >> [email protected]
> >>
> >>
> >> On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard
> >> <[email protected]> wrote:
> >> > Information
> >> > My "note_source" contain picture (.jpg, .png ...) in base64 and text.
> >> >
> >> > For my mapping I have used :
> >> > "type" => "string"
> >> > "analyzer" => "reuteurs" (the name of my analyzer)
> >> >
> >> >
> >> > Any idea ?
> >> >
> >> > Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit :
> >> >>
> >> >> Hello
> >> >> I have some issue, when I index a particular data "note_source" (sql
> >> >> longtext).
> >> >> I use the same analyzer for each fields (except date_source and
> >> >> id_source)
> >> >> but for "note_source", I have a "warn monitor.jvm".
> >> >> When I remove "note_source", everything fine. If I don't use
> analyzer
> >> >> on
> >> >> "note_source", everything fine, but if I use my analyzer on
> >> >> "note_source" I
> >> >> have some crash.
> >> >>
> >> >> I think I have enough memory, I have used ES_HEAP_SIZE.
> >> >> Maybe my problem it's with accent (ascii, utf-8)
> >> >>
> >> >> Can you help me with this ?
> >> >>
> >> >>
> >> >>
> >> >> My Setting
> >> >>
> >> >> public function createSetting($pf){
> >> >> $params = array('index' => $pf, 'body' => array(
> >> >> 'settings' => array(
> >> >> 'number_of_shards' => 5,
> >> >> 'number_of_replicas' => 0,
> >> >> 'analysis' => array(
> >> >> 'filter' => array(
> >> >> 'nGram' => array(
> >> >> "token_chars" =>array(),
> >> >> "type" => "nGram",
> >> >> "min_gram" => 3,
> >> >> "max_gram" => 250
> >> >> )
> >> >> ),
> >> >> 'analyzer' => array(
> >> >> 'reuters' => array(
> >> >> 'type' => 'custom',
> >> >> 'tokenizer' => 'standard',
> >> >> 'filter' => array('lowercase',
> 'asciifolding',
> >> >> 'nGram')
> >> >> )
> >> >> )
> >> >> )
> >> >> )
> >> >> ));
> >> >> $this->elasticsearchClient->indices()->create($params);
> >> >> return;
> >> >> }
> >> >>
> >> >>
> >> >> My Indexing
> >> >>
> >> >> public function indexTable($pf,$typeElement){
> >> >>
> >> >> $params =array(
> >> >> "index" =>'_river',
> >> >> "type" => $typeElement,
> >> >> "id" => "_meta",
> >> >> "body" =>array(
> >> >>
> >> >> "type" => "jdbc",
> >> >> "jdbc" => array(
> >> >> "url" => "jdbc:mysql://ip/name",
> >> >> "user" => 'root',
> >> >> "password" => 'mdp',
> >> >> "index" => $pf,
> >> >> "type" => $typeElement,
> >> >> "sql" => select id_source as _id, id_sous_theme,
> >> >> titre_source, desc_source, note_source, adresse_source, type_source,
> >> >> date_source from source,
> >> >> "max_bulk_requests" => 5,
> >> >> )
> >> >> )
> >> >>
> >> >> );
> >> >>
> >> >>
> >> >> $this->elasticsearchClient->index($params);
> >> >> }
> >> >>
> >> >> Thanks in advance.
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups
> >> > "elasticsearch" group.
> >> > To unsubscribe from this group and stop receiving emails from it,
> send
> >> > an
> >> > email to [email protected].
> >> > To view this discussion on the web visit
> >> >
> >> >
> https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
>
>
> >> > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups
> > "elasticsearch" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an
> > email to [email protected] <javascript:>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.
>
>
> >
> > For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7e086bdb-6eac-4d92-a9b1-c60262576588%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.