Re: Term pollution from binary data

robert engels Tue, 06 Nov 2007 17:16:23 -0800

I think the binary section recognizer is probably your best best.

If you write an analyzer that ignores terms that consist of onlyhexadecimal digits, and contain embedded digits, you will probablyreduce the pollution quite a bit, and it is trivial to write, and nottoo expensive to check.



On Nov 6, 2007, at 6:56 PM, Chuck Williams wrote:

Hi All,
We are experiencing OOM's when binary data contained in text files(e.g., a base64 section of a text file) is indexed. We haveextensive recognition of file types but have encountered binarysections inside of otherwise normal text files.
We are using the default value of 128 for termIndexInterval. Theproblem arises because binary data generates a large set of randomtokens, leading to totalTerms/termIndexInterval terms stored inmemory. Increasing the -Xmx is not viable as it is already maxed.
Does anybody know of a better solution to this problem than writingsome kind of binary section recognizer/filter?
It appears that termIndexInterval is factored into the stored indexand thus cannot be changed dynamically to work around the problemafter an index has become polluted. Other than identifying thedocuments containing binary data, deleting them, and thenoptimizing the whole index, has anybody found a better way torecover from this problem?
Thanks for any insights or suggestions,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term pollution from binary data

Reply via email to