I'm building a new system where I will have several pdf files. The content you will have to have in my indexes are: 1. Name 2. No. of Pages 3. Data File 4. Archive
When I run the search by the system, I will be typing full names that are stored within the file in the index, then I need that system resulting in me: - All variables above (file name, file date) and especially the page number where the occurrence happened and the line number and if possible the exact position of the line on where it starts to occur. I need it because I have to go back this occurrence for words that identify topics and subtopics, where traversing the file line by line backwards so allows me to identify the first subtopic and capture it and do the same when you find the topic . Not always the subtopic and the topic will be on the same page of the occurrence. example: Document: 00001.pdf *page 115 * Line 1: Line 2: *TTTTT* - TITLE occurrence (will be captured by the first occurrence of title) Line 3: Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will have already caught the first subtopic in line 6) Line 5: Line 6: *XXXX* - First subtitle (will be captured by the first occurrence of sought caption) Line 7: ... page ...116 ... page ...121 *page 122 * Line 1: line break Line 2: Content pertaining to occurrence ... Line 3: content from occurrence ... Line 4: FOUND TO OCCUR FOR EXAMPLE: *JOHN MCLAEN * Line 5: content from occurrence ... Line 6: line break Line 7: The big problem is that I do not know how to obtain this information from the page number and line number. Is there any functionality to it when I convert the PDF file to String in the index or will I have to store the Lucene index file line by line informing somehow the number of pages on which that file belongs? In the example above, I need the system resulting me: 1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with all the content that is before the name *JOHN MCLAEN* until the line break. Anyway, that will lead me to string containing the result of the occurrence starting at line 2 (after line break) on page 122 and ending the block to line 5 results (before the line break). *Example of result:* -------------------------------------------------------------------------------------------------------------------------------------- *Page: 122 - File: 00001.pdf* *TÓPIC: TTTTT* *SUB-TÓPIC: XXXXX* Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o executado Banco SantanderS/A devidamente intimado através de seu advogado a efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos reais e quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%, nos termos do artigo 475-J. - ADV: *JOHN MCLAEN* (OAB 103587/SP), MARISA REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP) -------------------------------------------------------------------------------------------------------------------------------------- Is this possible? Any help or hint will be of great value. Thank you very much.