hello,

I tried using pdftohtml (from sourceforge) to convert one of my PDFs to HTML
to see if I could use this approach to populate my DSpace with HTML
equivalents of my PDFs. I was not worried if the conversion ignored text
extraction, since I have found that pdfbox can do that for me. I just want a
good rendered result. The users of the digital library are to be offered
HTML links as an alternative for PDFs, presumably for those that don't like
PDFs (for whatever reason).

I was impressed by the demo on the pdftohtml page but when I tried it out on
an academic journal the result was terrible. Perhaps there is a problem with
the two-column layout that is common in journals. The demo was in the style
of a newspaper article, i.e one column with interspersed pictures. I am
pleased to say that in spite of the pdftohtml problems it did at least
render the figures in the journal article properly. But the two-column
aspect was ignored and it looks like the maths was not handled either. The
journals I will be dealing with are nearly all scientific and will have lots
of maths in them.

So I was wondering, what tools have DSpace people used to convert from PDF
to HTML. Maybe people know about pdftohtml and can make some suggestions.
Maybe there is a better tool that people here know about. I think here is a
good place to ask because it is very common that a digital library offers
both PDF and non-PDF forms of the same document. Surely I can't be the only
one that wants to set up such a library starting only from the PDFs.
-- 
Regards,

Andrew M.
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to