just store the whole thing into the indexc .. it'll make the index bigger but then it'll allow you to find method in madness, manually parsing a forty meg file everytime you need to display search results is too intensive.
Nader Henein -----Original Message----- From: Chris Sibert [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 19, 2002 10:47 AM To: Lucene Users List Subject: Re: Creating indexes The file that I have is big, about 40 MB. And it's got a whole lot of smaller documents in it - about 15 thousand - too many to separate into individual files. These individual documents are actually similar to emails stored in a large text file. The file is structured to an extent, with a number before each document - (ex: __10001__, __10002__, etc.), with the date, etc. Kind of like email headers. In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers field to index all of the document numbers, 2) a Dates field to index the document dates, 3) and a TextBody field to index all of the document text together. I'll have to write an InputStreamFilter or something to parse the data as it's coming in to the lucene IndexWriter, create a new document every time I hit a new number, and parse out the numbers - like __10001__ - so I can separate them out in the DocumentNumbers field, the dates into a Dates field, and the text in a TextBody field. It won't be pleasant writing that parser, but... My other issue at this point is how to then display the documents that relate to the search hits. I have to be able to open that 40 MB file and go to the document(s) that correspond to the hits in the index, for display to the user. Does Lucene keep a location stored in the index of where each word is found in the original file ? How do I know at what point in the original data file to find the offset to display the original document ? Is this something that I have to store myself in each document object in the index ? Is this why you create separate document objects in the Lucene index ? - Each new document object in the index will contain the file offset to the original data file ? And if Lucene doesn't put that file offset in there automagically, I would have to store that myself as I create the index, in someting like a FileOffsetLocation field, for each document. Am I on the right track here ? Whew. ----- Original Message ----- From: "none none" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, June 12, 2002 11:56 AM Subject: Re: Creating indexes > Lucene doesn't know where a file start or ends, actually it knows, but in your case 1 Docuemtn contains more small documents.If you want to split your big file in small files you must to that by yourself, Take a look at the Document class and you will see that Lucene use a Reader to index the body of a file, so may be you should build a class that return a Reader for each sub-document you want. > But i think is easier split your main document in small document, index this small documents with a common "keyword" that is the actual Big file name, so when you'll search you can understand where this "sub" document is allocated. After you index those files you can delete them. What you need is a BigDocumentManager that: > > 1.split your big file/s > 2.index them. (don't forget the keyword => big doc name) > 3.delete those "sub" documents (are like temp docs). > > Hope this helps. > > > -- > > On Wed, 12 Jun 2002 02:26:58 > Chris Sibert wrote: > >I have a big ( 40 MB or so) file to index. The file contains a whole bunch > >of documents, which are each pretty small, about a few typewritten pages > >long. There's a title, date, and author for each document, in addition to > >the documents' actual text. > > > >I'm not quite sure how you index this in Lucene. For each document in the > >original file, I assume that I create a separate Lucene Document object in > >the index with author, date, title, and text fields. If so, my question is > >that when I'm reading in the original file for indexing, does Lucene know > >where each document begins and ends in the original file ? Or do I have to > >write a parser or filter or something for the InputStream that's reading the > >file ? > > > >Chris Sibert > > > > > > > >-- > >To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > >For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > > > > > > _______________________________________________________ > WIN a first class trip to Hawaii. Live like the King of Rock and Roll > on the big Island. Enter Now! > http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
