Classes for index Pdf and word files in lucene. Ernesto. ----- Original Message ----- From: "Ernesto De Santis" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene.
Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. ----- Original Message ----- From: "Hartmann, Waehrisch & Feykes GmbH" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. > Hi Ernesto, > > the IndexManager retrieves a list of files of a folder by calling the method > getFilesInFolder of CmsObject. This method returns only empty files, i.e. > with empty content. To get the content of a pdf file you have to reread the > file: > f = cms.readFile(f.getAbsolutePath()); > > Bye, > Stephan > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie: > > > > Hello > > > > Thanks for the previous reply. > > > > Now, i use > > - version 1.4 of lucene searche module. (the version attached in this list) > > - new version of registry.xml format for module. (like you write me) > > - the pdf files are stored with the binary type. > > > > But i have the next problem: > > i can´t make a InputStream for the cmsfile content. > > For this i write this code in de Document method of my class PDFDocument: > > > > ----------------- > > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f is the > > parameter CmsFile of the Document method > > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. > > in file system work fine. > > > > > > bodyText = extractor.extractText(in); > > > > ---------------- > > > > Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? > > > > The error ocurr in the third line. > > In the PDFParcer. > > the error menssage in tomcat is: > > > > java.io.IOException: Error: Header is corrupt '' > > at PDFParcer.parse > > at PDFExtractor.extractText > > at PDFDocument.Document (my class) > > at..... > > > > By, and thanks. > > Ernesto. > > > > > > ----- Original Message ----- > > From: Hartmann, Waehrisch & Feykes GmbH > > To: [EMAIL PROTECTED] > > Sent: Friday, October 24, 2003 4:45 AM > > Subject: Re: [opencms-dev] Index pdf files with your content in lucene. > > > > > > Hello Ernesto, > > > > i assume you are using the unpatched version 1.3 of the search module. > > As i mentioned yesterday, the plainDocFactory does only index cmsFiles of > > type "plain" but not of type "binary". PDF files are stored as binary. I > > suggest to use the version i posted yesterday. Then your registry.xml would > > have to look like this: ... > > <docFactories> > > ... > > <docFactory type="plain" enabled="true"> > > ... > > </docFactory> > > <docFactory type="binary" enabled="true"> > > <fileType name="pdftext"> > > <extension>.pdf</extension> > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class> > > </fileType> > > </docFactory> > > ... > > </docFactories> > > > > Important: The type attribute must match the file types of OpenCms (also > > defined in the registry.xml). > > > > Bye, > > Stephan > > > > ----- Original Message ----- > > From: Ernesto De Santis > > To: Lucene Users List > > Cc: [EMAIL PROTECTED] > > Sent: Thursday, October 23, 2003 4:16 PM > > Subject: [opencms-dev] Index pdf files with your content in lucene. > > > > > > Hello > > > > I am new in opencms and lucene tecnology. > > > > I won index pdf files, and index de content of this files. > > > > I work in this way: > > > > Make a PDFDocument class like JspDocument class. > > use org.textmining.text.extraction.PDFExtractor class, this class work > > fine out of vfs. > > > > and write my registry.xml for pdf document, in plainDocFactory tag. > > > > <fileType name="pdftext"> > > <extension>.pdf</extension> > > <!-- This will strip tags before processing --> > > > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class> > > </fileType> > > > > my PDFDocument content this code: > > I think that the probrem is how take the content from CmsFile?, what > > InputStream use? PDFExtractor work with extractText(InputStream) method. > > > > public class PDFDocument implements I_DocumentConstants, > > I_DocumentFactory { > > > > public PDFDocument(){ > > > > } > > > > > > public Document Document(CmsObject cmsobject, CmsFile cmsfile) > > > > throws CmsException > > > > { > > > > return Document(cmsobject, cmsfile, null); > > > > } > > > > public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap > > hashmap) > > > > throws CmsException > > > > { > > > > Document document=(new BodylessDocument()).Document(cmsobject, > > cmsfile); > > > > > > //put de content in the pdf file. > > > > String contenido = new String(cmsfile.getContents()); > > > > StringBufferInputStream in = new StringBufferInputStream(contenido); > > > > // ByteArrayInputStream in = new > > ByteArrayInputStream(contenido.getBytes()); > > > > > > /* try{ > > > > FileInputStream in = new FileInputStream (cmsfile.getPath() + > > cmsfile.getName()); > > > > */ > > > > PDFExtractor extractor = new PDFExtractor(); > > > > String body = extractor.extractText(in); > > > > > > document.add(Field.Text("body", body)); > > > > /* }catch(FileNotFoundException e){ > > > > e.toString(); > > > > throw new CmsException(); > > > > } > > > > > > */ > > > > return (document); > > > > } > > > > > > thanks > > Ernesto > > PD: Sorry for my poor english. > > > > > > > > > > ----- Original Message ----- > > From: "Hartmann, Waehrisch & Feykes GmbH" > > <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> > > Sent: Wednesday, October 22, 2003 3:50 AM > > Subject: Re: [opencms-dev] (no subject) > > > > > Hi Ben, > > > > > > i think this won't work since the plainDocFactory will only be used > > > for files of type "plain" but not for files of type "binary". > > > Recently we have done some additions to the module - by order of > > > Lenord, Bauer & Co. GmbH - that could meet your needs. It introduces > > > a more flexible way of defining docFactories that you can add new > > > factories without having to recompile the whole module. So other > > > modules (like the news) can bring their own docFactory and all you > > > have to do is to edit the registry.xml. Here is an example: > > > > > > <docFactories> > > > <docFactory enabled="true" type="plain"> > > > <fileType name="plaintext"> > > > <extension>.txt</extension> > > > > > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class> > > > </fileType> > > > </docFactory> > > > <docFactory enabled="true" type="news"> > > > > > > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class> > > > </docFactory> > > > </docFactories> > > > > > > To index binary files all you need to add is this: > > > > > > <docFactory enabled="true" type="binary"> > > > > > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class> > > > </docFactory> > > > > > > There should be no need for an extension mapping. > > > > > > For the interested people: > > > For ContentDefinitions (like news) i introduced the following: > > > <contentDefinitions> > > > <contentDefinition type="news"> <!-- must match > > > docFactory type --> > > > > > > <class>com.opencms.modules.homepage.news.NewsContentDefinition</class > > >> > > > > > > <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</ > > >initCla ss> > > > <listMethod name="getNewsList"> > > > <param type="java.lang.Integer">1</param> > > > <param type="java.lang.String">-1</param> > > > </listMethod> > > > <page uri="/news.html?__element=entry"> > > > <param method="getIntId" name="newsid"/> > > > </page> > > > </contentDefinition> > > > > > > In short: > > > initClass is optional: For the news the news classes have to be > > > loaded to initialize the db pool. > > > listMethod: a method of the content definition class that returns a > > > List of elements > > > page: the page that can display an entry. Here a jsp that has a > > > template element "entry". It also needs the id of the news item. > > > getIntId is a method of the content definition class and newsid is > > > the url parameter the page needs. A link like > > > news.html?__element=entry&newsid=xy > > > will be generated. > > > > > > Best regards, > > > Stephan > > > > > > > > > ----- Original Message ----- > > > From: "Ben Rometsch" <[EMAIL PROTECTED]> > > > To: <[EMAIL PROTECTED]> > > > Sent: Wednesday, October 22, 2003 6:15 AM > > > Subject: [opencms-dev] (no subject) > > > > > > > Hi Matt, > > > > > > > > I am not having any joy! I've updated my registry.xml file, with > > > > the appropriate section reading: > > > > > > > > <luceneSearch> > > > > <mergeFactor>100000</mergeFactor> > > > > <permCheck>true</permCheck> > > > > <indexDir>c:\search</indexDir> > > > > > > > > <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</ana > > > >lyzer> <subsearch>true</subsearch> > > > > <project>online</project> > > > > <docFactories> > > > > <pageDocFactory enabled="true"> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PageDocument</class> > > > > </pageDocFactory> > > > > <plainDocFactory enabled="true"> > > > > <fileType name="plaintext"> > > > > <extension>.txt</extension> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class> > > > > </fileType> > > > > <fileType name="taggedtext"> > > > > <extension>.html</extension> > > > > <extension>.htm</extension> > > > > <extension>.xml</extension> > > > > <!-- This will strip tags before processing > > > > --> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</c > > > >lass> </fileType> > > > > > > > > <!-- Index binary documents --> > > > > <fileType name="plaindocument"> > > > > <extension>.doc</extension> > > > > <extension>.xls</extension> > > > > <extension>.pdf</extension> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</clas > > > >s> </fileType> > > > > > > > > </plainDocFactory> > > > > <jspDocFactory enabled="true"> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class> > > > > </jspDocFactory> > > > > <xmlTemplateDocFactory enabled="false"/> > > > > </docFactories> > > > > <directories> > > > > <directory location="/release/"> > > > > <section>Test</section> > > > > <subsearch>true</subsearch> > > > > </directory> > > > > <directory location="/RGLIntranet/"> > > > > <section>Test2</section> > > > > <subsearch>true</subsearch> > > > > </directory> > > > > </directories> > > > > </luceneSearch> > > > > > > > > Notice the section beginning after the remark "Index binary > > > > documents". > > > > > > > > But I cannot get any hits when searching for document names that > > > > are in > > > > > > the > > > > > > > VFS. The other (HTML) searches are working ok. Is the "name" > > > > property of > > > > > > the > > > > > > > fileType tag important? I wasn't sure what to add here...I'm not > > > > quite > > > > > > sure > > > > > > > how to move forward. Maybe it would be an idea to add some > > > > debugging trace to the BodylessDocument class to see what is going > > > > on inside it? I want to make sure my XML is correct first tho! > > > > > > > > Thanks for the help, > > > > Ben > > > > > > > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote: > > > > > Hi Matt, > > > > > > > > > > Thanks for the reply. If I just want to get the document title to > > > > > be included in the Lucene index, looking at the code in the > > > > > net.grcomputing.opencms.search.BodylessDocument class it appears > > > > > to > > > > > > ignore > > > > > > > > what the CMSObject is, and attempt to index it regardless. Is > > > > > this > > > > > > > > correct? > > > > > > > > > > > > Correct. It will already index the title, but it will not attempt > > > > to index the body. > > > > > > > > > If this is the case, is it simply a matter of instructing Lucene > > > > > to > > > > > > index > > > > > > > > obects other than HTML files in the VFS (i.e. Documents) ? Or > > > > > would I > > > > > > > > have > > > > > > > > > to create another class, something like > > > > > net.grcomputing.opencms.search.FileDocument and add a new hook > > > > > into that class via the registry.xml fragment? Or does the > > > > > BodyLess document > > > > > > > > provide > > > > > > > > > this functionality, and it's just a matter of adding a new XML > > > > > fragment > > > > > > to > > > > > > > > the registry.xml are? > > > > > > > > Again, you are right -- simply adding the appropriate configuration > > > > to the registry.xml file will suffice. I believe that you will just > > > > need to extend the plainDocument tag set to include extensions and > > > > processors... I _think_ that binary files get handled by the plain > > > > handler. > > > > > > > > Matt > > > > > > > > _______________________________________________ > > > > This mail is send to you from the opencms-dev mailing list > > > > To change your list options, or to unsubscribe from the list, > > > > please visit http://mail.opencms.org/mailman/listinfo/opencms-dev > > > > > > Stephan Hartmann > > > Unternehmensberatung Währisch & Feykes GmbH > > > Gustav-Adolf-Str. 5 > > > 47057 Duisburg > > > > > > Tel.: 0203-373070 > > > Fax: 0203-376766 > > > E-Mail: [EMAIL PROTECTED] > > > Internet: www.wfnetz.de > > > > > > Über das Internet versandte E-Mails können unter fremden Namen > > > erstellt oder manipuliert werden. Aus diesem Grund enthalten unsere > > > mit E-Mail verschickten Nachrichten grundsätzlich keine > > > rechtsverbindlichen Willenserklärungen. > > ---------------------------------------- > Content-Type: text/html; charset="iso-8859-1"; name="Anhang: 1" > Content-Transfer-Encoding: quoted-printable > Content-Description: > ---------------------------------------- > > -- > Stephan Hartmann > > Währisch & Feykes GmbH > Gustav-Adolf-Str. 5 > 47057 Duisburg > Tel. 0203 / 373 070 > Fax 0203 / 376 766 > [EMAIL PROTECTED] > > ------------------------------------------------------ > Ausschlusserklärung (Disclaimer): > Über das Internet versandte E-mails können unter fremden Namen erstellt oder > manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail verschickten > Nachrichten grundsätzlich keine rechtsverbindlichen Willenserklärungen. > _______________________________________________ > This mail is send to you from the opencms-dev mailing list > To change your list options, or to unsubscribe from the list, please visit > http://mail.opencms.org/mailman/listinfo/opencms-dev
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]