check the lucene users list. There are many threads talking about how to index PDF documents with lucene.


Bernhard

PROYECTA.Fernandez Garcia, Ivan schrieb:

Good morning everybody,

        Are there anyone that was indexed PDF files?
        If yes, could you say us how do you make it?

Thanks you or your attention.

-----Mensaje original-----
De: Xiaozheng Ma [mailto:[EMAIL PROTECTED]
Enviado el: miércoles, 17 de noviembre de 2004 16:52
Para: Lucene Developers List
Asunto: RE: Queries Lucene 1.3


You are right, of course the most important thing is to extract the text file and index it. something like:

        document.add(Field.Text("contents", ifile.getTextContents()));


I do a search using:

        public static Hits search(String queryString, String indexFilePath)
                        throws Exception {
                IndexSearcher searcher = new IndexSearcher(indexFilePath);
                Query query = QueryParser.parse(queryString, "contents",
                                new StandardAnalyzer());
                return searcher.search(query);
        }

One comment: if you postfix "*" at the end of search pattern, you will have
problem for some advance search for example: 1. phrase search: If you search on "a pretty cat", may get exception for "a pretty cat"*
2. you search on group for a field: for example:
Body:(pet AND "pretty") or are actually search on Body:(pet AND "pretty")*
The parser will give you an error. 3. in general if you have a sentence to search of use ) " etc


If you index them page by page, it should not have an OOME.

Hope this helps.

--
Xiaozheng

-----Original Message-----
From: PROYECTA.Fernandez Garcia, Ivan [mailto:[EMAIL PROTECTED]

Sent: Wednesday, November 17, 2004 10:32 AM
To: Lucene Developers List
Subject: RE: Queries Lucene 1.3

First of all, Xiaozheng  thanks for your attention.
I have tested it but we have no results.

I explain in detail:

        We would like search text in a pdf file.
     I think we must index the content of each page to search text, isn´t
it?
        So we must use sentence document.add(Field.Text()). isn´t it?
        We search text using following sentences:

                Query q = QueryParser.parse(m_texto + "*",
CValoresGlobales.M_CONTENIDO_PAGINA, analizador);
                q = q.rewrite(indexReader);
                hits = searcher.search(q);

        is O.K.?

Tnaks for your help.

-----Mensaje original-----
De: Xiaozheng Ma [mailto:[EMAIL PROTECTED]
Enviado el: miércoles, 17 de noviembre de 2004 16:21
Para: Lucene Developers List
Asunto: RE: Queries Lucene 1.3


I used the following to index and it works fine. document.add(Field.Text("author", ifile.getAuthor())); document.add(Field.Text("title", ifile.getTitle())); document.add(Field.Text("extension", ifile.getExtension()));

-----Original Message-----
From: PROYECTA.Fernandez Garcia, Ivan [mailto:[EMAIL PROTECTED]

Sent: Wednesday, November 17, 2004 10:08 AM
To: Lucene Developers List
Subject: RE: Queries Lucene 1.3

If we don´t update IndexWriter.minMergeDocs attribute, Lucene not found
anything (We don´t know why?)
When we change value for IndexWriter.minMergeDocs attribute and file has a
lot of pages. OutofMemory Exception ocurred.


-----Mensaje original----- De: Xiaozheng Ma [mailto:[EMAIL PROTECTED] Enviado el: miércoles, 17 de noviembre de 2004 15:59 Para: Lucene Developers List Asunto: RE: Queries Lucene 1.3


A bit confused if the first problem is solved (i.e. the break point at 10). For Out of memory exception(OOME), You need to increase the JVM MAX momoery size. IF you use tomcat 5, run tomcat5w.exe to reset this value ( or do it by editing registry, or if you wish change JAVA_OPTIONs of the carolina.bat or Carolina.sh).

Hope this works.

Xiaozheng


-----Original Message----- From: PROYECTA.Fernandez Garcia, Ivan [mailto:[EMAIL PROTECTED]

Sent: Wednesday, November 17, 2004 9:49 AM
To: [EMAIL PROTECTED]
Subject: Queries Lucene 1.3

Good afternoon everybody,

        First of all thanks for your attention.

        We are using Lucene1.3 api to index and search text in pdf files.
        We have two environment to develop with it: Windows, using Apache
Tomcat 5.0 and Sun Solaris, using Oracle Aplication Server.
        First we extract text pages from pdf file using Multivalent API
(this process seems run O.K.).
        Then we search text in new index created before. At this moment we
have the following problem:
                - If pdf file number page is 10, text is found.
                - If pdf file number page is more than 10, text is not
found.
        We modify IndexWriter.minMergeDocs attribute assign two values:
Total number document pages and "1" value.
        In both cases:
                - if document is not big, index process seems run O.K. and
text search seems run O.K.
                - if document is big (600 pages), index process run K.O
raising OutofMemory exception.

        We send you our source code file where index a pdf file and search
text if you can see some error.
        We don´t know what more have we do with this problem.
        Can you help us , please?

Thanks you for your help.

<<search_text.txt>> <<index_lucene.txt>>




Iván Fernández García
Proyecta Sistemas de Información







---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


----------------------------------------------
Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com ----------------------------------------------
http://www.iberia.com



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


----------------------------------------------
Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com ----------------------------------------------
http://www.iberia.com



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


----------------------------------------------
Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com ----------------------------------------------
http://www.iberia.com



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004


--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.797 / Virus Database: 541 - Release Date: 15/11/2004


----------------------------------------------
Has decidido el mejor precio. Has decidido IBERIA.com You´ve chosen the best price. You´ve chosen IBERIA.com ----------------------------------------------
http://www.iberia.com



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]






Reply via email to