Re: problem indexing with my analyzer

Tanguy Bernard Fri, 20 Jun 2014 02:56:32 -0700

The user copy/paste the content of an html page and me, I index this 
information. I take the entire document with image. I can't change this 
behavior.


I set max_gram=20. It's better but at the end I have this many times :

[2014-06-20 11:42:14,201][WARN ][monitor.jvm              ] [ik-test2] 
[gc][young][528][263] duration [2s], collections [1]/[2.1s], total 
[2s]/[43.9s], memory [536mb]->[580.2mb]/[1015.6mb], all_pools {[young] 
[22.5mb]->[22.3mb]/[66.5mb]}{[survivor] [14.9kb]->[49.3kb]/[8.3mb]}{[old] 
[513.4mb]->[557.8mb]/[940.8mb]}

I put ES_HEAP_SIZE : 2G. I think it's enough.
Something wrong ?

Le vendredi 20 juin 2014 11:45:22 UTC+2, Cédric Hourcade a écrit :
>
> If you are only searching in the text you should index the images in 
> an other field field. With no analyzer ("index: not_analyzed"), or 
> even better "index: no" (not indexed). If you need to retrieve the 
> image data it's still in the _source. 
>
> But to be honest I wouldn't even store this kind of information in ES, 
> your index is going to be bigger, merges are going to be slower... I'd 
> keep the binary files stored elsewhere. 
>
> Cédric Hourcade 
> [email protected] <javascript:> 
>
>
> On Fri, Jun 20, 2014 at 11:25 AM, Tanguy Bernard 
> <[email protected] <javascript:>> wrote: 
> > Yes, I am applying "reuters" on my document (compose by text and 
> picture). 
> > My goal is to do my research on the text of the document with any word 
> or 
> > part of a word. 
> > 
> > Yes the problem it's my nGram filter. 
> > How do I solve this problem ? Deacrease nGram max ? Change Analyzer by 
> an 
> > other but who satisfy my goal ? 
> > 
> > Le vendredi 20 juin 2014 10:58:49 UTC+2, Cédric Hourcade a écrit : 
> >> 
> >> Does it mean your applying the "reuters" analyzer on your base64 
> >> encoded pictures? 
> >> 
> >> I guess it generates a really huge number of tokens for each entry 
> >> because of your nGram filter (with a max at 250). 
> >> 
> >> Cédric Hourcade 
> >> [email protected] 
> >> 
> >> 
> >> On Fri, Jun 20, 2014 at 9:09 AM, Tanguy Bernard 
> >> <[email protected]> wrote: 
> >> > Information 
> >> > My "note_source" contain picture (.jpg, .png ...) in base64 and text. 
> >> > 
> >> > For my mapping I have used : 
> >> > "type" => "string" 
> >> > "analyzer" => "reuteurs" (the name of my analyzer) 
> >> > 
> >> > 
> >> > Any idea ? 
> >> > 
> >> > Le jeudi 19 juin 2014 17:57:46 UTC+2, Tanguy Bernard a écrit : 
> >> >> 
> >> >> Hello 
> >> >> I have some issue, when I index a particular data "note_source" (sql 
> >> >> longtext). 
> >> >> I use the same analyzer for each fields (except date_source and 
> >> >> id_source) 
> >> >> but for "note_source", I have a "warn monitor.jvm". 
> >> >> When I remove "note_source", everything fine. If I don't use 
> analyzer 
> >> >> on 
> >> >> "note_source", everything fine, but if I use my analyzer on 
> >> >> "note_source" I 
> >> >> have some crash. 
> >> >> 
> >> >> I think I have enough memory, I have used ES_HEAP_SIZE. 
> >> >> Maybe my problem it's with accent (ascii, utf-8) 
> >> >> 
> >> >> Can you help me with this ? 
> >> >> 
> >> >> 
> >> >> 
> >> >> My Setting 
> >> >> 
> >> >>  public function createSetting($pf){ 
> >> >>         $params = array('index' => $pf, 'body' => array( 
> >> >>         'settings' => array( 
> >> >>             'number_of_shards' => 5, 
> >> >>             'number_of_replicas' => 0, 
> >> >>             'analysis' => array( 
> >> >>                 'filter' => array( 
> >> >>                     'nGram' => array( 
> >> >>                         "token_chars" =>array(), 
> >> >>                         "type" => "nGram", 
> >> >>                         "min_gram" => 3, 
> >> >>                         "max_gram"  => 250 
> >> >>                     ) 
> >> >>                 ), 
> >> >>                 'analyzer' => array( 
> >> >>                     'reuters' => array( 
> >> >>                         'type' => 'custom', 
> >> >>                         'tokenizer' => 'standard', 
> >> >>                         'filter' => array('lowercase', 
> 'asciifolding', 
> >> >> 'nGram') 
> >> >>                     ) 
> >> >>                 ) 
> >> >>             ) 
> >> >>         ) 
> >> >>         )); 
> >> >>         $this->elasticsearchClient->indices()->create($params); 
> >> >>         return; 
> >> >> } 
> >> >> 
> >> >> 
> >> >> My Indexing 
> >> >> 
> >> >> public function indexTable($pf,$typeElement){ 
> >> >> 
> >> >>         $params =array( 
> >> >>             "index" =>'_river', 
> >> >>             "type" => $typeElement, 
> >> >>             "id" => "_meta", 
> >> >>             "body" =>array( 
> >> >> 
> >> >>                 "type" => "jdbc", 
> >> >>                 "jdbc" => array( 
> >> >>                     "url" => "jdbc:mysql://ip/name", 
> >> >>                     "user" => 'root', 
> >> >>                     "password" => 'mdp', 
> >> >>                     "index" => $pf, 
> >> >>                     "type" => $typeElement, 
> >> >>                     "sql" => select id_source as _id, id_sous_theme, 
> >> >> titre_source, desc_source, note_source, adresse_source, type_source, 
> >> >> date_source from source, 
> >> >>                     "max_bulk_requests" => 5, 
> >> >>                     ) 
> >> >>             ) 
> >> >> 
> >> >>         ); 
> >> >> 
> >> >> 
> >> >>         $this->elasticsearchClient->index($params); 
> >> >> } 
> >> >> 
> >> >> Thanks in advance. 
> >> > 
> >> > -- 
> >> > You received this message because you are subscribed to the Google 
> >> > Groups 
> >> > "elasticsearch" group. 
> >> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> >> > an 
> >> > email to [email protected]. 
> >> > To view this discussion on the web visit 
> >> > 
> >> > 
> https://groups.google.com/d/msgid/elasticsearch/5d93217c-bded-40fa-8fd2-fdac576c57ee%40googlegroups.com.
>  
>
> >> > For more options, visit https://groups.google.com/d/optout. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/b7daa716-cb5f-45cc-916b-43c7c0aea6b9%40googlegroups.com.
>  
>
> > 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7e086bdb-6eac-4d92-a9b1-c60262576588%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: problem indexing with my analyzer

Reply via email to