Re: Index pdf files with your content in lucene.

Ernesto De Santis Tue, 11 Nov 2003 20:33:04 -0800

try again zipping the files.

after i post the files in the web site.


> Could you also tell us a bit about this code?  Is it better than
> existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

This code use existing parsing solution.
The intent is make a lucene Document for index pdf and word files, with
content.
Is pure java.
Use TextExtraction library.
tm-extractors-0.2.jar
Use POI and PDFBox.

Ernesto
Sorry for my bad English.

>
> Thanks,
> Otis
>
>
> --- Ernesto De Santis <[EMAIL PROTECTED]> wrote:
> > Classes for index Pdf and word files in lucene.
> > Ernesto.
> >
> > ----- Original Message -----
> > From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, October 29, 2003 12:04 PM
> > Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> >
> >
> > Hello all,
> >
> > Thans very much Stephan for your valuable help.
> > Attached you will find the PDFDocument, and WordDocument class source
> > code
> >
> > Ernesto.
> >
> >
> > ----- Original Message -----
> > From: "Hartmann, Waehrisch & Feykes GmbH"
> > <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, October 28, 2003 11:10 AM
> > Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> >
> >
> > > Hi Ernesto,
> > >
> > > the IndexManager retrieves a list of files of a folder by calling
> > the
> > method
> > > getFilesInFolder of CmsObject. This method returns only empty
> > files, i.e.
> > > with empty content. To get the content of a pdf file you have to
> > reread
> > the
> > > file:
> > > f = cms.readFile(f.getAbsolutePath());
> > >
> > > Bye,
> > > Stephan
> > >
> > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
> > >
> > > > > Hello
> > > >
> > > > Thanks for the previous reply.
> > > >
> > > > Now, i use
> > > > - version 1.4 of lucene searche module. (the version attached in
> > this
> > list)
> > > > - new version of registry.xml format for module. (like you write
> > me)
> > > > - the pdf files are stored with the binary type.
> > > >
> > > > But i have the next problem:
> > > > i can´t make a InputStream for the cmsfile content.
> > > > For this i write this code in de Document method of my class
> > PDFDocument:
> > > >
> > > > -----------------
> > > >
> > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f
> > is the
> > > > parameter CmsFile of the Document method
> > > >
> > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
> > lib i
> > use.
> > > > in file system work fine.
> > > >
> > > >
> > > > bodyText = extractor.extractText(in);
> > > >
> > > > ----------------
> > > >
> > > > Is correct use ByteArrayInputStream for make a InputStream for a
> > CmsFile?
> > > >
> > > > The error ocurr in the third line.
> > > > In the PDFParcer.
> > > > the error menssage in tomcat is:
> > > >
> > > > java.io.IOException: Error: Header is corrupt ''
> > > > at PDFParcer.parse
> > > > at PDFExtractor.extractText
> > > > at PDFDocument.Document (my class)
> > > > at.....
> > > >
> > > > By, and thanks.
> > > > Ernesto.
> > > >
> > > >
> > > > ----- Original Message -----
> > > >   From: Hartmann, Waehrisch & Feykes GmbH
> > > >   To: [EMAIL PROTECTED]
> > > >   Sent: Friday, October 24, 2003 4:45 AM
> > > >   Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> > > >
> > > >
> > > >   Hello Ernesto,
> > > >
> > > >   i assume you are using the unpatched version 1.3 of the search
> > module.
> > > >   As i mentioned yesterday, the plainDocFactory does only index
> > cmsFiles
> > of
> > > > type "plain" but not of type "binary". PDF files are stored as
> > binary. I
> > > > suggest to use the version i posted yesterday. Then your
> > registry.xml
> > would
> > > > have to look like this: ...
> > > >   <docFactories>
> > > >   ...
> > > >      <docFactory type="plain" enabled="true">
> > > >   ...
> > > >      </docFactory>
> > > >      <docFactory type="binary" enabled="true">
> > > >         <fileType name="pdftext">
> > > >            <extension>.pdf</extension>
> > > >
> > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> > > >         </fileType>
> > > >      </docFactory>
> > > >   ...
> > > >   </docFactories>
> > > >
> > > >   Important: The type attribute must match the file types of
> > OpenCms
> > (also
> > > > defined in the registry.xml).
> > > >
> > > >   Bye,
> > > >   Stephan
> > > >
> > > >     ----- Original Message -----
> > > >     From: Ernesto De Santis
> > > >     To: Lucene Users List
> > > >     Cc: [EMAIL PROTECTED]
> > > >     Sent: Thursday, October 23, 2003 4:16 PM
> > > >     Subject: [opencms-dev] Index pdf files with your content in
> > lucene.
> > > >
> > > >
> > > >     Hello
> > > >
> > > >     I am new in opencms and lucene tecnology.
> > > >
> > > >     I won index pdf files, and index de content of this files.
> > > >
> > > >     I work in this way:
> > > >
> > > >     Make a PDFDocument class like JspDocument class.
> > > >     use org.textmining.text.extraction.PDFExtractor class, this
> > class
> > work
> > > > fine out of vfs.
> > > >
> > > >     and write my registry.xml for pdf document, in
> > plainDocFactory tag.
> > > >
> > > >                         <fileType name="pdftext">
> > > >                             <extension>.pdf</extension>
> > > >                             <!-- This will strip tags before
> > processing -->
> > > >
> > > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class>
> > > > </fileType>
> > > >
> > > >     my PDFDocument content this code:
> > > >     I think that the probrem is how take the content from
> > CmsFile?, what
> > > > InputStream use? PDFExtractor work with extractText(InputStream)
> > method.
> > > >
> > > >     public class PDFDocument implements I_DocumentConstants,
> > > > I_DocumentFactory {
> > > >
> > > >     public PDFDocument(){
> > > >
> > > >     }
> > > >
> > > >
> > > >     public Document Document(CmsObject cmsobject, CmsFile
> > cmsfile)
> > > >
> > > >     throws CmsException
> > > >
> > > >     {
> > > >
> > > >     return Document(cmsobject, cmsfile, null);
> > > >
> > > >     }
> > > >
> > > >     public Document Document(CmsObject cmsobject, CmsFile
> > cmsfile,
> > HashMap
> > > > hashmap)
> > > >
> > > >     throws CmsException
> > > >
> > > >     {
> > > >
> > > >     Document document=(new
> > BodylessDocument()).Document(cmsobject,
> > > > cmsfile);
> > > >
> > > >
> > > >     //put de content in the pdf file.
> > > >
> > > >     String contenido = new String(cmsfile.getContents());
> > > >
> > > >     StringBufferInputStream in = new
> > StringBufferInputStream(contenido);
> > > >
> > > >     // ByteArrayInputStream in = new
> > > > ByteArrayInputStream(contenido.getBytes());
> > > >
> > > >
> > > >     /* try{
> > > >
> > > >     FileInputStream in = new FileInputStream (cmsfile.getPath() +
> > > > cmsfile.getName());
> > > >
> > > >     */
> > > >
> > > >     PDFExtractor extractor = new PDFExtractor();
> > > >
> > > >     String body = extractor.extractText(in);
> > > >
> > > >
> > > >     document.add(Field.Text("body", body));
> > > >
> > > >     /* }catch(FileNotFoundException e){
> > > >
> > > >     e.toString();
> > > >
> > > >     throw new CmsException();
> > > >
> > > >     }
> > > >
> > > >
> > > >     */
> > > >
> > > >     return (document);
> > > >
> > > >     }
> > > >
> > > >
> > > >     thanks
> > > >     Ernesto
> > > >     PD: Sorry for my poor english.
> > > >
> > > >
> > > >
> > > >
> > > >     ----- Original Message -----
> > > >     From: "Hartmann, Waehrisch & Feykes GmbH"
> > > > <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>
> > > >     Sent: Wednesday, October 22, 2003 3:50 AM
> > > >     Subject: Re: [opencms-dev] (no subject)
> > > >
> > > >     > Hi Ben,
> > > >     >
> > > >     > i think this won't work since the plainDocFactory will only
> > be
> > used
> > > >     > for files of type "plain" but not for files of type
> > "binary".
> > > >     > Recently we have done some additions to the module - by
> > order of
> > > >     > Lenord, Bauer & Co. GmbH - that could meet your needs. It
> > introduces
> > > >     > a more flexible way of defining docFactories that you can
> > add new
> > > >     > factories without having to recompile the whole module. So
> > other
> > > >     > modules (like the news) can bring their own docFactory and
> > all you
> > > >     > have to do is to edit the registry.xml. Here is an example:
> > > >     >
> > > >     >             <docFactories>
> > > >     >                 <docFactory enabled="true" type="plain">
> > > >     >                     <fileType name="plaintext">
> > > >     >                         <extension>.txt</extension>
> > > >     >
> > > >     >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > >     >                     </fileType>
> > > >     >                 </docFactory>
> > > >     >                 <docFactory enabled="true" type="news">
> > > >     >
> > > >     >
> > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class>
> > > >     >                 </docFactory>
> > > >     >             </docFactories>
> > > >     >
> > > >     > To index binary files all you need to add is this:
> > > >     >
> > > >     >            <docFactory enabled="true" type="binary">
> > > >     >
> > > >     >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class>
> > > >     >            </docFactory>
> > > >     >
> > > >     > There should be no need for an extension mapping.
> > > >     >
> > > >     > For the interested people:
> > > >     > For ContentDefinitions (like news) i introduced the
> > following:
> > > >     >             <contentDefinitions>
> > > >     >                 <contentDefinition type="news"> <!-- must
> > match
> > > >     > docFactory type -->
> > > >     >
> > > >     >
> > <class>com.opencms.modules.homepage.news.NewsContentDefinition</class
> > > >     >>
> > > >     >
> > > >     >
> > <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</
> > > >     >initCla ss>
> > > >     >                     <listMethod name="getNewsList">
> > > >     >                         <param
> > type="java.lang.Integer">1</param>
> > > >     >                         <param
> > type="java.lang.String">-1</param>
> > > >     >                     </listMethod>
> > > >     >                     <page uri="/news.html?__element=entry">
> > > >     >                         <param method="getIntId"
> > name="newsid"/>
> > > >     >                     </page>
> > > >     >                 </contentDefinition>
> > > >     >
> > > >     > In short:
> > > >     > initClass is optional: For the news the news classes have
> > to be
> > > >     > loaded to initialize the db pool.
> > > >     > listMethod: a method of the content definition class that
> > returns
> > a
> > > >     > List of elements
> > > >     > page: the page that can display an entry. Here a jsp that
> > has a
> > > >     > template element "entry". It also needs the id of the news
> > item.
> > > >     > getIntId is a method of the content definition class and
> > newsid is
> > > >     > the url parameter the page needs. A link like
> > > >     > news.html?__element=entry&newsid=xy
> > > >     > will be generated.
> > > >     >
> > > >     > Best regards,
> > > >     > Stephan
> > > >     >
> > > >     >
> > > >     > ----- Original Message -----
> > > >     > From: "Ben Rometsch" <[EMAIL PROTECTED]>
> > > >     > To: <[EMAIL PROTECTED]>
> > > >     > Sent: Wednesday, October 22, 2003 6:15 AM
> > > >     > Subject: [opencms-dev] (no subject)
> > > >     >
> > > >     > > Hi Matt,
> > > >     > >
> > > >     > > I am not having any joy! I've updated my registry.xml
> > file, with
> > > >     > > the appropriate section reading:
> > > >     > >
> > > >     > > <luceneSearch>
> > > >     > > <mergeFactor>100000</mergeFactor>
> > > >     > > <permCheck>true</permCheck>
> > > >     > > <indexDir>c:\search</indexDir>
> > > >     > >
> > > >     > >
> > <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</ana
> > > >     > >lyzer> <subsearch>true</subsearch>
> > > >     > > <project>online</project>
> > > >     > > <docFactories>
> > > >     > > <pageDocFactory enabled="true">
> > > >     > >
> > > >     > >
> > <class>net.grcomputing.opencms.search.lucene.PageDocument</class>
> > > >     > > </pageDocFactory>
> > > >     > > <plainDocFactory enabled="true">
> > > >     > > <fileType name="plaintext">
> > > >     > > <extension>.txt</extension>
> > > >     > >
> > > >     > >
> > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class>
> > > >     > > </fileType>
> > > >     > > <fileType name="taggedtext">
> > > >     > > <extension>.html</extension>
> > > >     > > <extension>.htm</extension>
> > > >     > > <extension>.xml</extension>
> > > >     > > <!-- This will strip tags before processing
> > > >     > > -->
> > > >     > >
> > > >     > >
> > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</c
> > > >     > >lass> </fileType>
> > > >     > >
> > > >     > > <!-- Index binary documents -->
> > > >     > > <fileType name="plaindocument">
> > > >     > > <extension>.doc</extension>
> > > >     > > <extension>.xls</extension>
> > > >     > > <extension>.pdf</extension>
> > > >     > >
> > > >     > >
> > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</clas
> > > >     > >s> </fileType>
> > > >     > >
> > > >     > > </plainDocFactory>
> > > >     > > <jspDocFactory enabled="true">
> > > >     > >
> > > >     > >
> > <class>net.grcomputing.opencms.search.lucene.JspDocument</class>
> > > >     > > </jspDocFactory>
> > > >     > > <xmlTemplateDocFactory enabled="false"/>
> > > >     > > </docFactories>
> > > >     > > <directories>
> > > >     > > <directory location="/release/">
> > > >     > > <section>Test</section>
> > > >     > > <subsearch>true</subsearch>
> > > >     > > </directory>
> > > >     > > <directory location="/RGLIntranet/">
> > > >     > > <section>Test2</section>
> > > >     > > <subsearch>true</subsearch>
> > > >     > > </directory>
> > > >     > > </directories>
> > > >     > > </luceneSearch>
> > > >     > >
> > > >     > > Notice the section beginning after the remark "Index
> > binary
> > > >     > > documents".
> > > >     > >
> > > >     > > But I cannot get any hits when searching for document
> > names that
> > > >     > > are in
> > > >     >
> > > >     > the
> > > >     >
> > > >     > > VFS. The other (HTML) searches are working ok. Is the
> > "name"
> > > >     > > property of
> > > >     >
> > > >     > the
> > > >     >
> > > >     > > fileType tag important? I wasn't sure what to add
> > here...I'm not
> > > >     > > quite
> > > >     >
> > > >     > sure
> > > >     >
> > > >     > > how to move forward. Maybe it would be an idea to add
> > some
> > > >     > > debugging trace to the BodylessDocument class to see what
> > is
> > going
> > > >     > > on inside it? I want to make sure my XML is correct first
> > tho!
> > > >     > >
> > > >     > > Thanks for the help,
> > > >     > > Ben
> > > >     > >
> > > >     > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote:
> > > >     > > > Hi Matt,
> > > >     > > >
> > > >     > > > Thanks for the reply. If I just want to get the
> > document title
> > to
> > > >     > > > be included in the Lucene index, looking at the code in
> > the
> > > >     > > > net.grcomputing.opencms.search.BodylessDocument class
> > it
> > appears
> > > >     > > > to
> > > >     >
> > > >     > ignore
> > > >     >
> > > >     > > > what the CMSObject is, and attempt to index it
> > regardless. Is
> > > >     > > > this
> > > >     > >
> > > >     > > correct?
> > > >     > >
> > > >     > >
> > > >     > > Correct. It will already index the title, but it will not
> > attempt
> > > >     > > to index the body.
> > > >     > >
> > > >     > > > If this is the case, is it simply a matter of
> > instructing
> > Lucene
> > > >     > > > to
> > > >     >
> > > >     > index
> > > >     >
> > > >     > > > obects other than HTML files in the VFS  (i.e.
> > Documents) ? Or
> > > >     > > > would I
> > > >     > >
> > > >     > > have
> > > >     > >
> > > >     > > > to create another class, something like
> > > >     > > > net.grcomputing.opencms.search.FileDocument and add a
> > new hook
> > > >     > > > into that class via the registry.xml fragment?  Or does
> > the
> > > >     > > > BodyLess document
> > > >     > >
> > > >     > > provide
> > > >     > >
> > > >     > > > this functionality, and it's just a matter of adding a
> > new XML
> > > >     > > > fragment
> > > >     >
> > > >     > to
> > > >     >
> > > >     > > > the registry.xml are?
> > > >     > >
> > > >     > > Again, you are right -- simply adding the appropriate
> > configuration
> > > >     > > to the registry.xml file will suffice. I believe that you
> > will
> > just
> > > >     > > need to extend the plainDocument tag set to include
> > extensions
> > and
> > > >     > > processors... I _think_ that binary files get handled by
> > the
> > plain
> > > >     > > handler.
> > > >     > >
> > > >     > > Matt
> > > >     > >
> > > >     > > _______________________________________________
> > > >     > > This mail is send to you from the opencms-dev mailing
> > list
> > > >     > > To change your list options, or to unsubscribe from the
> > list,
> > > >     > > please visit
> > http://mail.opencms.org/mailman/listinfo/opencms-dev
> > > >     >
> > > >     > Stephan Hartmann
> > > >     > Unternehmensberatung Währisch & Feykes GmbH
> > > >     > Gustav-Adolf-Str. 5
> > > >     > 47057 Duisburg
> > > >     >
> > > >     > Tel.: 0203-373070
> > > >     > Fax: 0203-376766
> > > >     > E-Mail: [EMAIL PROTECTED]
> > > >     > Internet: www.wfnetz.de
> > > >     >
> > > >     > Über das Internet versandte E-Mails können unter fremden
> > Namen
> > > >     > erstellt oder manipuliert werden. Aus diesem Grund
> > enthalten
> > unsere
> > > >     > mit E-Mail verschickten Nachrichten grundsätzlich keine
> > > >     > rechtsverbindlichen Willenserklärungen.
> > >
> > > ----------------------------------------
> > > Content-Type: text/html; charset="iso-8859-1"; name="Anhang: 1"
> > > Content-Transfer-Encoding: quoted-printable
> > > Content-Description:
> > > ----------------------------------------
> > >
> > > --
> > > Stephan Hartmann
> > >
> > > Währisch & Feykes GmbH
> > > Gustav-Adolf-Str. 5
> > > 47057 Duisburg
> > > Tel. 0203 / 373 070
> > > Fax 0203 / 376 766
> > > [EMAIL PROTECTED]
> > >
> > > ------------------------------------------------------
> > > Ausschlusserklärung (Disclaimer):
> > > Über das Internet versandte E-mails können unter fremden Namen
> > erstellt
> > oder
> > > manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail
> > verschickten
> > > Nachrichten grundsätzlich keine rechtsverbindlichen
> > Willenserklärungen.
> > > _______________________________________________
> > > This mail is send to you from the opencms-dev mailing list
> > > To change your list options, or to unsubscribe from the list,
> > please visit
> > > http://mail.opencms.org/mailman/listinfo/opencms-dev
> >
> >
> > >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> __________________________________
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
>
>
>

/* ====================================================================
 * The TextMining.Org Software License, Version 1.1
 *
 * Copyright (c) 2002 Ryan Ackley  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in
 *    the documentation and/or other materials provided with the
 *    distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *    if any, must include the following acknowledgment:
 *       "This product includes software developed by 
 *        TextMining.Org (http://www.textmining.org/)."
 *    Alternately, this acknowledgment may appear in the software itself,
 *    if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The name "TextMining.Org" must not be used to endorse or promote products
 *    derived from this software without prior written permission. For
 *    written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "TextMining.org",
 *    "TextMining.org extractors", nor may "TextMining.org" appear in their name, 
without
 *    prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ====================================================================
 *
 * This software consists of voluntary contributions made by many
 * individuals on behalf of the Apache Software Foundation.  For more
 * information on the Apache Software Foundation, please see
 * <http://www.apache.org/>.
 */

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index pdf files with your content in lucene.

Reply via email to