Term pollution from binary data

Chuck Williams Tue, 06 Nov 2007 16:54:17 -0800

Hi All,

We are experiencing OOM's when binary data contained in text files(e.g., a base64 section of a text file) is indexed. We have extensiverecognition of file types but have encountered binary sections inside ofotherwise normal text files.

We are using the default value of 128 for termIndexInterval. Theproblem arises because binary data generates a large set of randomtokens, leading to totalTerms/termIndexInterval terms stored in memory.Increasing the -Xmx is not viable as it is already maxed.

Does anybody know of a better solution to this problem than writing somekind of binary section recognizer/filter?

It appears that termIndexInterval is factored into the stored index andthus cannot be changed dynamically to work around the problem after anindex has become polluted. Other than identifying the documentscontaining binary data, deleting them, and then optimizing the wholeindex, has anybody found a better way to recover from this problem?


Thanks for any insights or suggestions,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Term pollution from binary data

Reply via email to