Re: [HippoCMS-dev] Help with PDFExtractor

Jasha Joachimsthal Sun, 27 Dec 2009 23:59:53 -0800

2009/12/16 Maurizio Pillitu <[email protected]>:
> Got it to work!
>
> There were some restrictions in the DASL query that were excluding the PDF
> result to come out.
>
> Thanks a lot!


You're welcome!

>
> mau
>
> On Wed, Dec 16, 2009 at 11:26 AM, Jasha Joachimsthal <
> [email protected]> wrote:
>
>> 2009/12/16 Maurizio Pillitu <[email protected]>:
>> > Thanks guys,
>> > I know I was missing some bits of the big picture :)
>> >
>> > So here's the next question: when I perform a DASL query, I normally
>> > *select* some properties *from* some repository location (path) *where* a
>> > certain property matches one or more conditions; if I don't have a
>> property
>> > to match, how can I define the *where* condition?
>> >
>> > sounds like a very stupid question .... sorry for that.
>>
>> There are no stupid questions!
>> For fulltext search, you can do <d:contains>mySearchWord</d:contains>
>> If you really need properties, you can let the user set them in the
>> assets perspective. See [1]
>>
>> [1]
>> http://wiki.onehippo.com/display/CMS/WebDAV+properties+used+by+Hippo+CMS
>>
>> > Thx again
>> >
>> > mau
>> >
>> > On Wed, Dec 16, 2009 at 11:01 AM, Jeroen Reijn <[email protected]>
>> wrote:
>> >
>> >> Hi Maurizio,
>> >>
>> >> as far as I know the pdf extractor as you have you configured now
>> extracts
>> >> all content to the lucene index only and makes sure that the text can be
>> >> found and mapped to the pdf document. I don't think Slide has a
>> repository
>> >> extractor that can extract the information and store it as a property.
>> >>
>> >> Regards,
>> >>
>> >> Jeroen
>> >>
>> >> Maurizio Pillitu wrote:
>> >>
>> >>> Hi everyone,
>> >>> I'm trying to use the PDFExtractor (using Hippo Repository 1.2.15);
>> I've
>> >>> added to my (default) extractors.xml the following:
>> >>>
>> >>> ....
>> >>> <extractor classname="org.apache.slide.extractor.PDFExtractor"
>> >>> uri="/files/default.preview/binaries" content-type="application/pdf"/>
>> >>> .....
>> >>>
>> >>> then I dropped a Google Docs generated PDF file (attached) in
>> >>> /files/default.preview/binaries (via WebDAV); I see the repository
>> logging
>> >>> some interesting bits (attached) as if the extraction process went
>> fine,
>> >>> but
>> >>> I can't see the extracted data; I'd have expected a WebDAV property
>> >>> attached
>> >>> to the file, but nothing shows up; this is the list of properties
>> related
>> >>> with the PDF file (using DAVExplorer)
>> >>>
>> >>> getlastmodified DAV: Wed, 16 Dec 2009 09:38:35 GMT
>> >>> displayname DAV: this_is_my_title.pdf
>> >>> modificationdate DAV: 2009-12-16T09:38:35Z
>> >>> UID DAV: 96da71317f000001004b0bbb796bcb32
>> >>> supportedlock DAV:
>> >>> getcontenttype DAV: application/pdf
>> >>> getcontentlength DAV: 5078
>> >>> resourcetype DAV:
>> >>> getcontentlanguage DAV: en
>> >>> getetag DAV: ada3fdca64b1fd70a3d7b2ed66b3e68b
>> >>> lockdiscovery DAV:
>> >>> source DAV:
>> >>> creationdate DAV: 2009-12-16T09:38:35Z
>> >>>
>> >>>
>> >>> I feel like I'm missing something on how the PDFExtractor works; I've
>> >>> looked
>> >>> for some documentation or specific configurations, but I couldn't find
>> >>> anything interesting.
>> >>>
>> >>> Any hints?
>> >>> TIA
>> >>>  mau
>> >>>
>> >>> Met vriendelijke groet,
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------
>> >>>
>> >>>
>> >>> ********************************************
>> >>> Hippocms-dev: Hippo CMS development public mailinglist
>> >>>
>> >>> Searchable archives can be found at:
>> >>> MarkMail: http://hippocms-dev.markmail.org
>> >>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>> >>>
>> >>>  ********************************************
>> >> Hippocms-dev: Hippo CMS development public mailinglist
>> >>
>> >> Searchable archives can be found at:
>> >> MarkMail: http://hippocms-dev.markmail.org
>> >> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>> >>
>> >>
>> >
>> >
>> > --
>> >
>> > Met vriendelijke groet,
>> > --
>> > Maurizio Pillitu - 0031 (0)615655668
>> > Opensource Software Engineer
>> > Scrum Certified Master - http://www.scrumalliance.org
>> > Sourcesense - making sense of Open Source: http://www.sourcesense.com
>> > ********************************************
>> > Hippocms-dev: Hippo CMS development public mailinglist
>> >
>> > Searchable archives can be found at:
>> > MarkMail: http://hippocms-dev.markmail.org
>> > Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>> >
>> >
>> ********************************************
>> Hippocms-dev: Hippo CMS development public mailinglist
>>
>> Searchable archives can be found at:
>> MarkMail: http://hippocms-dev.markmail.org
>> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>>
>>
>
>
> --
>
> Met vriendelijke groet,
> --
> Maurizio Pillitu - 0031 (0)615655668
> Opensource Software Engineer
> Scrum Certified Master - http://www.scrumalliance.org
> Sourcesense - making sense of Open Source: http://www.sourcesense.com
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
>
> Searchable archives can be found at:
> MarkMail: http://hippocms-dev.markmail.org
> Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
>
>
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

Re: [HippoCMS-dev] Help with PDFExtractor

Reply via email to