Re: Special characters prevent entity being indexed

Pekka Nykyri Wed, 19 Nov 2008 03:26:42 -0800


Thanks for the quick answer!

I haven't specified the analyzer so it should be the StandardAnalyzer. Iforgot to mention that I'm using Lucene via Hibernate seach where I caneasily define the fields in the hibernate POJO-classes. But as far as Iknow this shouldn't change things that much because I can use the coreLucene.

And I've used Luke already and the indexed special characters arerepresented as "¡"(¡) and "¿" (¿) in the index.

But the analyzer should have nothing to do with the problem currentlybecause the problem is that, those entities that start with "¿" don't getindexed at all. And some of those starting with "¡" get indexed and somedon't. Currently 29 entities don't get indexed at all (8900 in total).

I don't need to be able to search those special characters. I just needthose entities getting indexed. The other information in those entities ismore important and it's the names (starting with those special characters)that seems to make those entities not getting indexed.

Could I fix this using some analyzer during indexing? Actually I triedusing custom analyzer with "ISOLatin1AccentFilter()" but it didn't changeanything. In hibernate search the analyzer is spesified in a property fileor in the POJO-classes but I didn't seem to get it to work. The text wentto the index exactly the same way (when I see it with Luke) like beforeand the same entities were still missing.

Good solution for me would be that those special character would getdeleted alltogether from the index so maybe then they wouldn't cause anytrouble. Like "¡Fantástico!- blaaba" would be perfectly okay looking like"Fantastico- blaaba".


Thanks again in advance,
pn

On Tue, 18 Nov 2008, Erick Erickson wrote:

What analyzer are you using at index and search time? Typical problems
include:
using an analyzer that doesn't understand accented chars (StandardAnalyzer
for instance)
using a different anlyzer during search and index.

Search the user list for "accent" and you'll find this kind of problem
discussed,
and if that doesn't help we need to know what analyzers you are using and
what behavior you really want. Typically, for instance, *requiring* a user
to
type the upside-down exclamation point to get a match on this field would
be considered incorrect.

Also, you'd be helped a lot be getting a copy of Luke and examining your
index
to see exactly what's been indexed, it'll reveal a lot.

Best
Erick

On Tue, Nov 18, 2008 at 10:05 AM, Pekka Nykyri <[EMAIL PROTECTED]>wrote:

Hi!

I'm having problems with entities including special characters (Spanish
language) not getting indexed.

I haven't been able to find the the reason why some entities get indexed
while some don't.

I have 3 fields that (currently) hold the same value. The value for the
fields is example "¡Fantástico!- blaaba". Then when I change ONE of the
three values to "¡Fantástico! - blaaba", the entity gets indexed. So
chanching only one field makes it to index.

But the bigger problem with this is, that I have almost (other fields are
almost similar and I don't think they cause the problem) similar entity,
with exactly the same three "¡Fantástico!- blaaba" -fields and it gets
indexed normally. Even though the "critical" fields are exactly the same.

And also all entities where three fields start with "upside down ?"-mark
doesn't get indexed.

I'm really confused with the problem because I don't seem to be able to
find any logic some entities not being indexed even though they are similar
to some other. And changing only one value of the three makes it index.

Sorry for a really messy message but I just can't explain it more clearly
now.

Thanks in advance,
pn

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Special characters prevent entity being indexed

Reply via email to