indexing big files

Jonathan_Wasson Tue, 08 Jan 2002 14:08:40 -0800

Question from a Lucene newbie... I'm trying to index a file structure which
happens to include a relatively large file (310kb with 55,700 words) and
for some reason it appears to hanging the whole indexing process.  Here's a
quick run-down..


1) Am using a webcrawler to retrieve files and copy to my local disk.
2) For files like .pdf's... I'm copying an .html equivalent of the file to
my disk (but leaving .pdf extension).
3) Then later in a serperate batch process I run pretty much the standard
out of the box "org.apache.lucene.IndexHTML" demo class (except I've added
.pdf as a possible indexing type).

That's about it.  No big deal.  The transformation from pdf to html is not
perfected yet either... so file size will definitely drop in the future...
as nonsense terms are being included in these files.  But for now... what
should I be looking at or altering to find out what is causing the hang?
Thanks!

Jon Wasson


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

indexing big files

Reply via email to