Antonio Fiol Bonn�n wrote:
> a) Refactoring SimpleLuceneXMLIndexerImpl so that its private method
> indexDocument is not private, and taking it to an external component.
>
> b) Creating a PDFGenerator (in the cocoon sense of generator,
> of course).
>
> Option (a) seems to be giving us more headaches than pleasure, and
> option (b) seems cleaner to a certain point. Option (b) would allow to
> follow links in the PDF file, if developed to that point.
I like option (b) too. You could start with plain text, but it could later be
developed to extract basic formatting, hyperlinks, bookmarks (the table of contents),
images, etc.
> However, option (b) implies choosing a format for its output (which?),
An interesting question. Perhaps html, and begin with an implementation which produces:
<html>
<head/>
<body>
blah blah blah<br/>
blah blah<br/>
<br class="page"/>
...
</body>
</html>
Later you (or someone else) could add extra things as they need them.
Alternatively, you could use a more PDF-oriented DTD.
I have used a simple freeware tool called pdftohtml which produces XML according to
the following DTD:
<!ELEMENT pdf2xml (page+)>
<!ELEMENT page (fontspec*, text*)>
<!ATTLIST page
number CDATA #REQUIRED
position CDATA #REQUIRED
top CDATA #REQUIRED
left CDATA #REQUIRED
height CDATA #REQUIRED
width CDATA #REQUIRED
>
<!ELEMENT fontspec EMPTY>
<!ATTLIST fontspec
id CDATA #REQUIRED
size CDATA #REQUIRED
family CDATA #REQUIRED
color CDATA #REQUIRED
>
<!ELEMENT text (#PCDATA | b | i)*>
<!ATTLIST text
top CDATA #REQUIRED
left CDATA #REQUIRED
width CDATA #REQUIRED
height CDATA #REQUIRED
font CDATA #REQUIRED
>
<!ELEMENT b (#PCDATA)>
<!ELEMENT i (#PCDATA)>
> and also poses some problems wrt. the sitemap. Until now, we have a
> pipeline using a reader to read pdf files (static, from disk). And we
> would need a generator to be invoked instead for the content and links
> views. How can we do that? Maybe with a selector? But that does not
> seem very clean. Any hints there?
I'm not sure. It might work. I hope someone else can help you with that. But NB
there's also another way to build a Lucene index - using the LuceneIndexTransformer
rather than by crawling the site and using views. This technique would certainly work
with option (b) - a PDFGenerator - but I'm not sure that it would integrate nicely
with option (a) since it's a transformer and therefore requires XML. So if you could
resolve the sitemap issue with option (b) then it would work with both indexing
techniques, whereas option (a) could only ever work with the crawler, I think.
Cheers
Con