Yes I did not know how nGram works !
I find a perfect solution for my picture (base64) problem : use *'char_filter'
=>array('html_strip'),*
public function createSetting($pf){
$params = array('index' => $pf, 'body' => array(
'settings' => array(
'number_of_shards' => 5,
'number_of_replicas' => 0,
'analysis' => array(
'filter' => array(
'MYnGram' => array(
"token_chars" =>array(),
"type" => "nGram",
"min_gram" => 3,
"max_gram" => 20
)
),
'analyzer' => array(
'reuters' => array(
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => array('lowercase', 'asciifolding',
'MYnGram'),
'char_filter' =>array('html_strip'),
),
)
)
)
));
$this->elasticsearchClient->indices()->create($params);
}
Thanks to all of you !
Le samedi 21 juin 2014 00:35:39 UTC+2, Clinton Gormley a écrit :
>
> You seriously don't want 3..250 length ngrams!!!! That's ENORMOUS
>
> Typically set min/max to 3 or 4, and that's it
>
>
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html#_ngrams_for_partial_matching
>
>
> On 20 June 2014 16:05, Tanguy Bernard <[email protected] <javascript:>
> > wrote:
>
>> Thank you Cédric Hourcade !
>>
>> Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>>
>>> If your base64 encodes are long, they are going to be splited in a lot
>>> of tokens by the standard tokenizer.
>>>
>>> Theses tokens are often going to be a lot longer than standard words,
>>> so your nGram filter will generate even more tokens, a lot more than
>>> with standard text. That may be your problem there.
>>>
>>> You should really try to strip the encoded images with a simple regex
>>> from your documents before indexing them. If you need to keep the
>>> source, put the raw text in an unindexed field, and the cleaned one in
>>> another.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2bdd5f30-8e97-43e0-8478-08cc26a03ed9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.