Antonio Fiol Bonnín wrote:

> a) Refactoring SimpleLuceneXMLIndexerImpl so that its private method
> indexDocument is not private, and taking it to an external component.
> 
> b) Creating a PDFGenerator (in the cocoon sense of generator, 
> of course).
> 
> Option (a) seems to be giving us more headaches than pleasure, and
> option (b) seems cleaner to a certain point. Option (b) would allow to
> follow links in the PDF file, if developed to that point.

I like option (b) too. You could start with plain text, but it could later be 
developed to extract basic formatting, hyperlinks, bookmarks (the table of contents), 
images, etc.

> However, option (b) implies choosing a format for its output (which?),

An interesting question. Perhaps html, and begin with an implementation which produces:

<html>
   <head/>
   <body>
      blah blah blah<br/>
      blah blah<br/>
      <br class="page"/>
      ... 
   </body>
</html>

Later you (or someone else) could add extra things as they need them. 

Alternatively, you could use a more PDF-oriented DTD.

I have used a simple freeware tool called pdftohtml which produces XML according to 
the following DTD:

<!ELEMENT pdf2xml (page+)>
<!ELEMENT page (fontspec*, text*)>
<!ATTLIST page
        number CDATA #REQUIRED
        position CDATA #REQUIRED
        top CDATA #REQUIRED
        left CDATA #REQUIRED
        height CDATA #REQUIRED
        width CDATA #REQUIRED
>
<!ELEMENT fontspec EMPTY>
<!ATTLIST fontspec
        id CDATA #REQUIRED
        size CDATA #REQUIRED
        family CDATA #REQUIRED
        color CDATA #REQUIRED
>
<!ELEMENT text (#PCDATA | b | i)*>
<!ATTLIST text
        top CDATA #REQUIRED
        left CDATA #REQUIRED
        width CDATA #REQUIRED
        height CDATA #REQUIRED
        font CDATA #REQUIRED
>
<!ELEMENT b (#PCDATA)>
<!ELEMENT i (#PCDATA)>

> and also poses some problems wrt. the sitemap. Until now, we have a
> pipeline using a reader to read pdf files (static, from disk). And we
> would need a generator to be invoked instead for the content and links
> views. How can we do that? Maybe with a selector? But that does not
> seem very clean. Any hints there?

I'm not sure. It might work. I hope someone else can help you with that. But NB 
there's also another way to build a Lucene index - using the LuceneIndexTransformer 
rather than by crawling the site and using views. This technique would certainly work 
with option (b) - a PDFGenerator - but I'm not sure that it would integrate nicely 
with option (a) since it's a transformer and therefore requires XML. So if you could 
resolve the sitemap issue with option (b) then it would work with both indexing 
techniques, whereas option (a) could only ever work with the crawler, I think.

Cheers

Con

Reply via email to