Antonio Fiol Bonnín wrote: > a) Refactoring SimpleLuceneXMLIndexerImpl so that its private method > indexDocument is not private, and taking it to an external component. > > b) Creating a PDFGenerator (in the cocoon sense of generator, > of course). > > Option (a) seems to be giving us more headaches than pleasure, and > option (b) seems cleaner to a certain point. Option (b) would allow to > follow links in the PDF file, if developed to that point.
I like option (b) too. You could start with plain text, but it could later be developed to extract basic formatting, hyperlinks, bookmarks (the table of contents), images, etc. > However, option (b) implies choosing a format for its output (which?), An interesting question. Perhaps html, and begin with an implementation which produces: <html> <head/> <body> blah blah blah<br/> blah blah<br/> <br class="page"/> ... </body> </html> Later you (or someone else) could add extra things as they need them. Alternatively, you could use a more PDF-oriented DTD. I have used a simple freeware tool called pdftohtml which produces XML according to the following DTD: <!ELEMENT pdf2xml (page+)> <!ELEMENT page (fontspec*, text*)> <!ATTLIST page number CDATA #REQUIRED position CDATA #REQUIRED top CDATA #REQUIRED left CDATA #REQUIRED height CDATA #REQUIRED width CDATA #REQUIRED > <!ELEMENT fontspec EMPTY> <!ATTLIST fontspec id CDATA #REQUIRED size CDATA #REQUIRED family CDATA #REQUIRED color CDATA #REQUIRED > <!ELEMENT text (#PCDATA | b | i)*> <!ATTLIST text top CDATA #REQUIRED left CDATA #REQUIRED width CDATA #REQUIRED height CDATA #REQUIRED font CDATA #REQUIRED > <!ELEMENT b (#PCDATA)> <!ELEMENT i (#PCDATA)> > and also poses some problems wrt. the sitemap. Until now, we have a > pipeline using a reader to read pdf files (static, from disk). And we > would need a generator to be invoked instead for the content and links > views. How can we do that? Maybe with a selector? But that does not > seem very clean. Any hints there? I'm not sure. It might work. I hope someone else can help you with that. But NB there's also another way to build a Lucene index - using the LuceneIndexTransformer rather than by crawling the site and using views. This technique would certainly work with option (b) - a PDFGenerator - but I'm not sure that it would integrate nicely with option (a) since it's a transformer and therefore requires XML. So if you could resolve the sitemap issue with option (b) then it would work with both indexing techniques, whereas option (a) could only ever work with the crawler, I think. Cheers Con