Hello,

> Ard,
> 
> 
> > > By coincidence I discovered that the xml file contains leading 
> > > binary characters (ff fe) and that it as a whole is seen 
> as binary 
> > > by my text editor. So perhaps this is causing the 
> duplicate results.
> 
> I came across this link: 
> http://www.25hoursaday.com/weblog/2005/10/18/TheMythOfTheOffic
> eXMLBinaryKey.aspx
> 
> 
> 
> It mentions the ff fe bytes ( to indicate little-endian 
> order)  I see at the beginning of my document.
> 
> The xml files contain the heading <?xml version="1.0" 
> encoding="utf-16"?> specifying the encoding.
> 
> > > I'll try to get them removed and see whether the issues 
> is resolved.  
> > 
> > If you could do a test with this, it would give me some 
> pointers indeed...
> 
> When I manually overwrite a document (left out the two bites 
> and also the encoding) the index is being 'repaired' and only 
> one hit is found with a search. It looks like the trailing 
> bytes and the encoding are causing the unexpected search results.

Whoow, must admit I learned something new today :-) Great research Æde, I would 
have not guessed this from the top of my head. I also know lucene trunk has 
done some parts which make use of \uffff kind of special chars, so am wondering 
whether this might give collisions as well as what you encountered. 

Is it possible for you to store the documents as utf-8?

Regards Ard

> 
> 
> --Æde
> 
> 
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today 
> it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01
> /********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
> 
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
> 
> 
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Reply via email to