(moved to nutch-user)

Tomi NA wrote:
On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Tomi NA wrote:
> On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:
>> On Thu, 7 Sep 2006, Tomi NA wrote:
>> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> >> Is there any filter available for extracting text from MS
>> Powerpoint files
>> >> and indexing them?
>> >> The lucene website suggests the POI project, which, it seems does not
>> >> support PPT files as of now.
>> >
>> > http://jakarta.apache.org/poi/hslf/index.html
>> >
>> > It doesn't say poi doesn't support ppt. It just says support is
>> limited.
>> > Don't know exactly how limited, but certainly not useless for indexing
>> > purposes.
>>
>> Support for editing and adding things to PowerPoint files is limited, as
>> is getting out the finer points of fonts and positioning.
>
> Which brings me to another (off)topic: can lucene/nutch assign
> different weights to tokens in the same document field? An obvious
> example would be: "this text seems to be in large, bold, blinking
> letters: I'll assume it's more important than the surrounding 8px
> text."

No, it can't (at least not yet). As a workaround you can extract these
portions of text to another field (or multiple fields), and then add
them with a higher boost. Then, expand your queries so that they include
also this field. This way, if query matches these special tokens,
results will get higher rank because of matching on this boosted field.

I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

You can write a HtmlParseFilter, which will extract these portions of text and put them into ParseData.metadata. Then, during indexing you can check if such metadata exists and if yes - add it as separate fields. You will need also to modify the QueryFilters, to expand user queries to also include clauses for these additional fields.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to