Hi Maurizio,
as far as I know the pdf extractor as you have you configured now
extracts all content to the lucene index only and makes sure that the
text can be found and mapped to the pdf document. I don't think Slide
has a repository extractor that can extract the information and store it
as a property.
Regards,
Jeroen
Maurizio Pillitu wrote:
Hi everyone,
I'm trying to use the PDFExtractor (using Hippo Repository 1.2.15); I've
added to my (default) extractors.xml the following:
....
<extractor classname="org.apache.slide.extractor.PDFExtractor"
uri="/files/default.preview/binaries" content-type="application/pdf"/>
.....
then I dropped a Google Docs generated PDF file (attached) in
/files/default.preview/binaries (via WebDAV); I see the repository logging
some interesting bits (attached) as if the extraction process went fine, but
I can't see the extracted data; I'd have expected a WebDAV property attached
to the file, but nothing shows up; this is the list of properties related
with the PDF file (using DAVExplorer)
getlastmodified DAV: Wed, 16 Dec 2009 09:38:35 GMT
displayname DAV: this_is_my_title.pdf
modificationdate DAV: 2009-12-16T09:38:35Z
UID DAV: 96da71317f000001004b0bbb796bcb32
supportedlock DAV:
getcontenttype DAV: application/pdf
getcontentlength DAV: 5078
resourcetype DAV:
getcontentlanguage DAV: en
getetag DAV: ada3fdca64b1fd70a3d7b2ed66b3e68b
lockdiscovery DAV:
source DAV:
creationdate DAV: 2009-12-16T09:38:35Z
I feel like I'm missing something on how the PDFExtractor works; I've looked
for some documentation or specific configurations, but I couldn't find
anything interesting.
Any hints?
TIA
mau
Met vriendelijke groet,
------------------------------------------------------------------------
********************************************
Hippocms-dev: Hippo CMS development public mailinglist
Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
********************************************
Hippocms-dev: Hippo CMS development public mailinglist
Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html