Ernesto, it looks like something got stripped. A ZIP file should make
it to the list. If not, maybe you can post it somewhere.
Could you also tell us a bit about this code? Is it better than
existing PDF/Word parsing solutions? Pure Java? Uses POI?
Thanks,
Otis
--- Ernesto De Santis <[EMAIL PROTECTED]> wrote:
> Classes for index Pdf and word files in lucene.
> Ernesto.
>
> ----- Original Message -----
> From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, October 29, 2003 12:04 PM
> Subject: Re: [opencms-dev] Index pdf files with your content in
> lucene.
>
>
> Hello all,
>
> Thans very much Stephan for your valuable help.
> Attached you will find the PDFDocument, and WordDocument class source
> code
>
> Ernesto.
>
>
> ----- Original Message -----
> From: "Hartmann, Waehrisch & Feykes GmbH"
> <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, October 28, 2003 11:10 AM
> Subject: Re: [opencms-dev] Index pdf files with your content in
> lucene.
>
>
> > Hi Ernesto,
> >
> > the IndexManager retrieves a list of files of a folder by calling
> the
> method
> > getFilesInFolder of CmsObject. This method returns only empty
> files, i.e.
> > with empty content. To get the content of a pdf file you have to
> reread
> the
> > file:
> > f = cms.readFile(f.getAbsolutePath());
> >
> > Bye,
> > Stephan
> >
> > Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
> >
> > > > Hello
> > >
> > > Thanks for the previous reply.
> > >
> > > Now, i use
> > > - version 1.4 of lucene searche module. (the version attached in
> this
> list)
> > > - new version of registry.xml format for module. (like you write
> me)
> > > - the pdf files are stored with the binary type.
> > >
> > > But i have the next problem:
> > > i can�t make a InputStream for the cmsfile content.
> > > For this i write this code in de Document method of my class
> PDFDocument:
> > >
> > > -----------------
> > >
> > > InputStream in = new ByteArrayInputStream(f.getContents()); //f
> is the
> > > parameter CmsFile of the Document method
> > >
> > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
> lib i
> use.
> > > in file system work fine.
> > >
> > >
> > > bodyText = extractor.extractText(in);
> > >
> > > ----------------
> > >
> > > Is correct use ByteArrayInputStream for make a InputStream for a
> CmsFile?
> > >
> > > The error ocurr in the third line.
> > > In the PDFParcer.
> > > the error menssage in tomcat is:
> > >
> > > java.io.IOException: Error: Header is corrupt ''
> > > at PDFParcer.parse
> > > at PDFExtractor.extractText
> > > at PDFDocument.Document (my class)
> > > at.....
> > >
> > > By, and thanks.
> > > Ernesto.
> > >
> > >
> > > ----- Original Message -----
> > > From: Hartmann, Waehrisch & Feykes GmbH
> > > To: [EMAIL PROTECTED]
> > > Sent: Friday, October 24, 2003 4:45 AM
> > > Subject: Re: [opencms-dev] Index pdf files with your content in
> lucene.
> > >
> > >
> > > Hello Ernesto,
> > >
> > > i assume you are using the unpatched version 1.3 of the search
> module.
> > > As i mentioned yesterday, the plainDocFactory does only index
> cmsFiles
> of
> > > type "plain" but not of type "binary". PDF files are stored as
> binary. I
> > > suggest to use the version i posted yesterday. Then your
> registry.xml
> would
> > > have to look like this: ...
> > > <docFactories>
> > > ...
> > > <docFactory type="plain" enabled="true">
> > > ...
> > > </docFactory>
> > > <docFactory type="binary" enabled="true">
> > > <fileType name="pdftext">
> > > <extension>.pdf</extension>
> > >
> <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> > > </fileType>
> > > </docFactory>
> > > ...
> > > </docFactories>
> > >
> > > Important: The type attribute must match the file types of
> OpenCms
> (also
> > > defined in the registry.xml).
> > >
> > > Bye,
> > > Stephan
> > >
> > > ----- Original Message -----
> > > From: Ernesto De Santis
> > > To: Lucene Users List
> > > Cc: [EMAIL PROTECTED]
> > > Sent: Thursday, October 23, 2003 4:16 PM
> > > Subject: [opencms-dev] Index pdf files with your content in
> lucene.
> > >
> > >
> > > Hello
> > >
> > > I am new in opencms and lucene tecnology.
> > >
> > > I won index pdf files, and index de content of this files.
> > >
> > > I work in this way:
> > >
> > > Make a PDFDocument class like JspDocument class.
> > > use org.textmining.text.extraction.PDFExtractor class, this
> class
> work
> > > fine out of vfs.
> > >
> > > and write my registry.xml for pdf document, in
> plainDocFactory tag.
> > >
> > > <fileType name="pdftext">
> > > <extension>.pdf</extension>
> > > <!-- This will strip tags before
> processing -->
> > >
> > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> > > </fileType>
> > >
> > > my PDFDocument content this code:
> > > I think that the probrem is how take the content from
> CmsFile?, what
> > > InputStream use? PDFExtractor work with extractText(InputStream)
> method.
> > >
> > > public class PDFDocument implements I_DocumentConstants,
> > > I_DocumentFactory {
> > >
> > > public PDFDocument(){
> > >
> > > }
> > >
> > >
> > > public Document Document(CmsObject cmsobject, CmsFile
> cmsfile)
> > >
> > > throws CmsException
> > >
> > > {
> > >
> > > return Document(cmsobject, cmsfile, null);
> > >
> > > }
> > >
> > > public Document Document(CmsObject cmsobject, CmsFile
> cmsfile,
> HashMap
> > > hashmap)
> > >
> > > throws CmsException
> > >
> > > {
> > >
> > > Document document=(new
> BodylessDocument()).Document(cmsobject,
> > > cmsfile);
> > >
> > >
> > > //put de content in the pdf file.
> > >
> > > String contenido = new String(cmsfile.getContents());
> > >
> > > StringBufferInputStream in = new
> StringBufferInputStream(contenido);
> > >
> > > // ByteArrayInputStream in = new
> > > ByteArrayInputStream(contenido.getBytes());
> > >
> > >
> > > /* try{
> > >
> > > FileInputStream in = new FileInputStream (cmsfile.getPath() +
> > > cmsfile.getName());
> > >
> > > */
> > >
> > > PDFExtractor extractor = new PDFExtractor();
> > >
> > > String body = extractor.extractText(in);
> > >
> > >
> > > document.add(Field.Text("body", body));
> > >
> > > /* }catch(FileNotFoundException e){
> > >
> > > e.toString();
> > >
> > > throw new CmsException();
> > >
> > > }
> > >
> > >
> > > */
> > >
> > > return (document);
> > >
> > > }
> > >
> > >
> > > thanks
> > > Ernesto
> > > PD: Sorry for my poor english.
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: "Hartmann, Waehrisch & Feykes GmbH"
> > > <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>
> > > Sent: Wednesday, October 22, 2003 3:50 AM
> > > Subject: Re: [opencms-dev] (no subject)
> > >
> > > > Hi Ben,
> > > >
> > > > i think this won't work since the plainDocFactory will only
> be
> used
> > > > for files of type "plain" but not for files of type
> "binary".
> > > > Recently we have done some additions to the module - by
> order of
> > > > Lenord, Bauer & Co. GmbH - that could meet your needs. It
> introduces
> > > > a more flexible way of defining docFactories that you can
> add new
> > > > factories without having to recompile the whole module. So
> other
> > > > modules (like the news) can bring their own docFactory and
> all you
> > > > have to do is to edit the registry.xml. Here is an example:
> > > >
> > > > <docFactories>
> > > > <docFactory enabled="true" type="plain">
> > > > <fileType name="plaintext">
> > > > <extension>.txt</extension>
> > > >
> > > >
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > > </fileType>
> > > > </docFactory>
> > > > <docFactory enabled="true" type="news">
> > > >
> > > >
> <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
> > > > </docFactory>
> > > > </docFactories>
> > > >
> > > > To index binary files all you need to add is this:
> > > >
> > > > <docFactory enabled="true" type="binary">
> > > >
> > > >
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> > > > </docFactory>
> > > >
> > > > There should be no need for an extension mapping.
> > > >
> > > > For the interested people:
> > > > For ContentDefinitions (like news) i introduced the
> following:
> > > > <contentDefinitions>
> > > > <contentDefinition type="news"> <!-- must
> match
> > > > docFactory type -->
> > > >
> > > >
> <class>com.opencms.modules.homepage.news.NewsContentDefinition</class
> > > >>
> > > >
> > > >
> <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</
> > > >initCla ss>
> > > > <listMethod name="getNewsList">
> > > > <param
> type="java.lang.Integer">1</param>
> > > > <param
> type="java.lang.String">-1</param>
> > > > </listMethod>
> > > > <page uri="/news.html?__element=entry">
> > > > <param method="getIntId"
> name="newsid"/>
> > > > </page>
> > > > </contentDefinition>
> > > >
> > > > In short:
> > > > initClass is optional: For the news the news classes have
> to be
> > > > loaded to initialize the db pool.
> > > > listMethod: a method of the content definition class that
> returns
> a
> > > > List of elements
> > > > page: the page that can display an entry. Here a jsp that
> has a
> > > > template element "entry". It also needs the id of the news
> item.
> > > > getIntId is a method of the content definition class and
> newsid is
> > > > the url parameter the page needs. A link like
> > > > news.html?__element=entry&newsid=xy
> > > > will be generated.
> > > >
> > > > Best regards,
> > > > Stephan
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Ben Rometsch" <[EMAIL PROTECTED]>
> > > > To: <[EMAIL PROTECTED]>
> > > > Sent: Wednesday, October 22, 2003 6:15 AM
> > > > Subject: [opencms-dev] (no subject)
> > > >
> > > > > Hi Matt,
> > > > >
> > > > > I am not having any joy! I've updated my registry.xml
> file, with
> > > > > the appropriate section reading:
> > > > >
> > > > > <luceneSearch>
> > > > > <mergeFactor>100000</mergeFactor>
> > > > > <permCheck>true</permCheck>
> > > > > <indexDir>c:\search</indexDir>
> > > > >
> > > > >
> <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</ana
> > > > >lyzer> <subsearch>true</subsearch>
> > > > > <project>online</project>
> > > > > <docFactories>
> > > > > <pageDocFactory enabled="true">
> > > > >
> > > > >
> <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > > > > </pageDocFactory>
> > > > > <plainDocFactory enabled="true">
> > > > > <fileType name="plaintext">
> > > > > <extension>.txt</extension>
> > > > >
> > > > >
> <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > > > </fileType>
> > > > > <fileType name="taggedtext">
> > > > > <extension>.html</extension>
> > > > > <extension>.htm</extension>
> > > > > <extension>.xml</extension>
> > > > > <!-- This will strip tags before processing
> > > > > -->
> > > > >
> > > > >
> <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</c
> > > > >lass> </fileType>
> > > > >
> > > > > <!-- Index binary documents -->
> > > > > <fileType name="plaindocument">
> > > > > <extension>.doc</extension>
> > > > > <extension>.xls</extension>
> > > > > <extension>.pdf</extension>
> > > > >
> > > > >
> <class>net.grcomputing.opencms.search.lucene.BodylessDocument</clas
> > > > >s> </fileType>
> > > > >
> > > > > </plainDocFactory>
> > > > > <jspDocFactory enabled="true">
> > > > >
> > > > >
> <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > > > > </jspDocFactory>
> > > > > <xmlTemplateDocFactory enabled="false"/>
> > > > > </docFactories>
> > > > > <directories>
> > > > > <directory location="/release/">
> > > > > <section>Test</section>
> > > > > <subsearch>true</subsearch>
> > > > > </directory>
> > > > > <directory location="/RGLIntranet/">
> > > > > <section>Test2</section>
> > > > > <subsearch>true</subsearch>
> > > > > </directory>
> > > > > </directories>
> > > > > </luceneSearch>
> > > > >
> > > > > Notice the section beginning after the remark "Index
> binary
> > > > > documents".
> > > > >
> > > > > But I cannot get any hits when searching for document
> names that
> > > > > are in
> > > >
> > > > the
> > > >
> > > > > VFS. The other (HTML) searches are working ok. Is the
> "name"
> > > > > property of
> > > >
> > > > the
> > > >
> > > > > fileType tag important? I wasn't sure what to add
> here...I'm not
> > > > > quite
> > > >
> > > > sure
> > > >
> > > > > how to move forward. Maybe it would be an idea to add
> some
> > > > > debugging trace to the BodylessDocument class to see what
> is
> going
> > > > > on inside it? I want to make sure my XML is correct first
> tho!
> > > > >
> > > > > Thanks for the help,
> > > > > Ben
> > > > >
> > > > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > > > > Hi Matt,
> > > > > >
> > > > > > Thanks for the reply. If I just want to get the
> document title
> to
> > > > > > be included in the Lucene index, looking at the code in
> the
> > > > > > net.grcomputing.opencms.search.BodylessDocument class
> it
> appears
> > > > > > to
> > > >
> > > > ignore
> > > >
> > > > > > what the CMSObject is, and attempt to index it
> regardless. Is
> > > > > > this
> > > > >
> > > > > correct?
> > > > >
> > > > >
> > > > > Correct. It will already index the title, but it will not
> attempt
> > > > > to index the body.
> > > > >
> > > > > > If this is the case, is it simply a matter of
> instructing
> Lucene
> > > > > > to
> > > >
> > > > index
> > > >
> > > > > > obects other than HTML files in the VFS (i.e.
> Documents) ? Or
> > > > > > would I
> > > > >
> > > > > have
> > > > >
> > > > > > to create another class, something like
> > > > > > net.grcomputing.opencms.search.FileDocument and add a
> new hook
> > > > > > into that class via the registry.xml fragment? Or does
> the
> > > > > > BodyLess document
> > > > >
> > > > > provide
> > > > >
> > > > > > this functionality, and it's just a matter of adding a
> new XML
> > > > > > fragment
> > > >
> > > > to
> > > >
> > > > > > the registry.xml are?
> > > > >
> > > > > Again, you are right -- simply adding the appropriate
> configuration
> > > > > to the registry.xml file will suffice. I believe that you
> will
> just
> > > > > need to extend the plainDocument tag set to include
> extensions
> and
> > > > > processors... I _think_ that binary files get handled by
> the
> plain
> > > > > handler.
> > > > >
> > > > > Matt
> > > > >
> > > > > _______________________________________________
> > > > > This mail is send to you from the opencms-dev mailing
> list
> > > > > To change your list options, or to unsubscribe from the
> list,
> > > > > please visit
> http://mail.opencms.org/mailman/listinfo/opencms-dev
> > > >
> > > > Stephan Hartmann
> > > > Unternehmensberatung W�hrisch & Feykes GmbH
> > > > Gustav-Adolf-Str. 5
> > > > 47057 Duisburg
> > > >
> > > > Tel.: 0203-373070
> > > > Fax: 0203-376766
> > > > E-Mail: [EMAIL PROTECTED]
> > > > Internet: www.wfnetz.de
> > > >
> > > > �ber das Internet versandte E-Mails k�nnen unter fremden
> Namen
> > > > erstellt oder manipuliert werden. Aus diesem Grund
> enthalten
> unsere
> > > > mit E-Mail verschickten Nachrichten grunds�tzlich keine
> > > > rechtsverbindlichen Willenserkl�rungen.
> >
> > ----------------------------------------
> > Content-Type: text/html; charset="iso-8859-1"; name="Anhang: 1"
> > Content-Transfer-Encoding: quoted-printable
> > Content-Description:
> > ----------------------------------------
> >
> > --
> > Stephan Hartmann
> >
> > W�hrisch & Feykes GmbH
> > Gustav-Adolf-Str. 5
> > 47057 Duisburg
> > Tel. 0203 / 373 070
> > Fax 0203 / 376 766
> > [EMAIL PROTECTED]
> >
> > ------------------------------------------------------
> > Ausschlusserkl�rung (Disclaimer):
> > �ber das Internet versandte E-mails k�nnen unter fremden Namen
> erstellt
> oder
> > manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail
> verschickten
> > Nachrichten grunds�tzlich keine rechtsverbindlichen
> Willenserkl�rungen.
> > _______________________________________________
> > This mail is send to you from the opencms-dev mailing list
> > To change your list options, or to unsubscribe from the list,
> please visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
>
>
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]