Bruce Altner wrote: > This is a good lead but it prompts me to ask this: if tools like > openoffice and others (like Acrobat Distiller) know how to reformat > Excel, PowerPoint, and Word it means that the data formats of these > files, as streams, must be public knowledge. If so, where do you get > this information? I would use it to try to build my own parser with > JavaCC if I knew what bytes in the the input stream I needed to tokenize.
I'm not sure what JavaCC has to do with word files. > > Obviously it's not that easy, otherwise we wouldn't need projects like > POI. But is this correct, at least conceptually? POI is a bit more ambitious than that. We look to create full APIs for read/write and modify, not just convert to ASCII. > > Disclaimer: I'm new to JavaCC but the topic interests me so I'm > willing to put in the time to learn it. It would be easier to take Ryan's early code for POI::HDF and create a parser from it. Once HDF is in Beta or so I plan to do so. -Andy > > Bruce > > At 10:50 AM 5/30/2002 -0500, you wrote: > >> This might be worth looking into for those who need to parse word, >> excel, >> powerpoint, or other MS file types of microsofts. >> >> openoffice - www.openoffice.org knows how to parse all of the microsoft >> formats (at least all that I've tried so far) - and then, you can a do a >> save as, and write out the open office format, which is a couple of xml >> files zipped together. So, this makes me think of two possible ways >> that >> you could get at the content of the MS files in a text form you can >> index >> (neither of which I have tried or even looked to see if they are >> possible) >> >> #1 - get the code for openoffice - it is open source - and use it for >> parsing the MS documents into xml which could then be indexed >> >> #2 - if open office is programmatically drivable (which I don't know >> if it >> is), fire up a copy of open office and use it to convert the files as >> necessary. >> >> Just some suggestions. Does anyone know much more about openoffice? I >> would be interested in knowing if either of these would be feasible. >> >> Dan >> >> >> >> >> -----Original Message----- >> From: Ewout Prangsma [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, May 29, 2002 1:00 PM >> To: Lucene Users List >> Subject: Re: MS Word Search ?? >> >> >> Op Wednesday 29 May 2002 11:56, Karl �ie schreef: >> > b: convert the documents to something that is accessable through >> java like >> > xml, etc... >> >> We're using wvWare (wvware.com) to convert word to html (or text) and >> index >> that and xpdf for converting PDF to text and index that. Any links on >> indexing using POI converters (or other java converters) are very >> welcome! >> >> Ewout >> >> > >> > the best way is to convert as the java api's for MSOffice documents >> still >> > are under development >> > >> > mvh karl �ie >> > >> > On Wednesday 29 May 2002 11:48, Rama Krishna wrote: >> > > Hi, >> > > >> > > I am trying to build a search engine which search in MS Word, >> excel, ppt >> > > and adobe pdf. I am not sure whether i can use Lucene for this or >> not. >> > > pl. help me out in this regard. >> > > >> > > >> > > Regards, >> > > Ramakrishna >> > > >> > > >> > > _________________________________________________________________ >> > > Chat with friends online, try MSN Messenger: >> http://messenger.msn.com >> >> -- >> Ewout Prangsma, Directeur >> Daisy Software >> Telefoon/fax: +31-77-3270305/3270306 >> Email: [EMAIL PROTECTED] >> Website: www.daisysoftware.com >> KvK Venlo nr. 12046144 >> >> >> >> >> -- >> To unsubscribe, e-mail: >> <mailto:[EMAIL PROTECTED]> >> For additional commands, e-mail: >> <mailto:[EMAIL PROTECTED]> >> >> -- >> To unsubscribe, e-mail: >> <mailto:[EMAIL PROTECTED]> >> For additional commands, e-mail: >> <mailto:[EMAIL PROTECTED]> > > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
