Thanks. ----- Original Message ----- From: "Nader S. Henein" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, June 19, 2002 2:54 AM Subject: RE: Creating indexes
> just store the whole thing into the indexc .. it'll make the index bigger > but then it'll allow you to find method in madness, manually parsing a forty > meg file everytime you need to display search results is too intensive. > > Nader Henein > > -----Original Message----- > From: Chris Sibert [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, June 19, 2002 10:47 AM > To: Lucene Users List > Subject: Re: Creating indexes > > > The file that I have is big, about 40 MB. And it's got a whole lot of > smaller documents in it - about 15 thousand - too many to separate into > individual files. These individual documents are actually similar to emails > stored in a large text file. The file is structured to an extent, with a > number before each document - (ex: __10001__, __10002__, etc.), with the > date, etc. Kind of like email headers. > > In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers > field to index all of the document numbers, 2) a Dates field to index the > document dates, 3) and a TextBody field to index all of the document text > together. I'll have to write an InputStreamFilter or something to parse the > data as it's coming in to the lucene IndexWriter, create a new document > every time I hit a new number, and parse out the numbers - like __10001__ - > so I can separate them out in the DocumentNumbers field, the dates into a > Dates field, and the text in a TextBody field. It won't be pleasant writing > that parser, but... > > My other issue at this point is how to then display the documents that > relate to the search hits. I have to be able to open that 40 MB file and go > to the document(s) that correspond to the hits in the index, for display to > the user. Does Lucene keep a location stored in the index of where each word > is found in the original file ? How do I know at what point in the original > data file to find the offset to display the original document ? Is this > something that I have to store myself in each document object in the index ? > Is this why you create separate document objects in the Lucene index ? - > Each new document object in the index will contain the file offset to the > original data file ? And if Lucene doesn't put that file offset in there > automagically, I would have to store that myself as I create the index, in > someting like a FileOffsetLocation field, for each document. Am I on the > right track here ? > > Whew. > > ----- Original Message ----- > From: "none none" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, June 12, 2002 11:56 AM > Subject: Re: Creating indexes > > > > Lucene doesn't know where a file start or ends, actually it knows, but in > your case 1 Docuemtn contains more small documents.If you want to split your > big file in small files you must to that by yourself, Take a look at the > Document class and you will see that Lucene use a Reader to index the body > of a file, so may be you should build a class that return a Reader for each > sub-document you want. > > But i think is easier split your main document in small document, index > this small documents with a common "keyword" that is the actual Big file > name, so when you'll search you can understand where this "sub" document is > allocated. After you index those files you can delete them. What you need is > a BigDocumentManager that: > > > > 1.split your big file/s > > 2.index them. (don't forget the keyword => big doc name) > > 3.delete those "sub" documents (are like temp docs). > > > > Hope this helps. > > > > > > -- > > > > On Wed, 12 Jun 2002 02:26:58 > > Chris Sibert wrote: > > >I have a big ( 40 MB or so) file to index. The file contains a whole > bunch > > >of documents, which are each pretty small, about a few typewritten pages > > >long. There's a title, date, and author for each document, in addition to > > >the documents' actual text. > > > > > >I'm not quite sure how you index this in Lucene. For each document in the > > >original file, I assume that I create a separate Lucene Document object > in > > >the index with author, date, title, and text fields. If so, my question > is > > >that when I'm reading in the original file for indexing, does Lucene know > > >where each document begins and ends in the original file ? Or do I have > to > > >write a parser or filter or something for the InputStream that's reading > the > > >file ? > > > > > >Chris Sibert > > > > > > > > > > > >-- > > >To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > >For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > > > > > > > > _______________________________________________________ > > WIN a first class trip to Hawaii. Live like the King of Rock and Roll > > on the big Island. Enter Now! > > http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
