On 9/8/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > (moved to nutch-user) > > Tomi NA wrote: > > On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > >> Tomi NA wrote: > >> > On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote: > >> >> On Thu, 7 Sep 2006, Tomi NA wrote: > >> >> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote: > >> >> >> Is there any filter available for extracting text from MS > >> >> Powerpoint files > >> >> >> and indexing them? > >> >> >> The lucene website suggests the POI project, which, it seems > >> does not > >> >> >> support PPT files as of now. > >> >> > > >> >> > http://jakarta.apache.org/poi/hslf/index.html > >> >> > > >> >> > It doesn't say poi doesn't support ppt. It just says support is > >> >> limited. > >> >> > Don't know exactly how limited, but certainly not useless for > >> indexing > >> >> > purposes. > >> >> > >> >> Support for editing and adding things to PowerPoint files is > >> limited, as > >> >> is getting out the finer points of fonts and positioning. > >> > > >> > Which brings me to another (off)topic: can lucene/nutch assign > >> > different weights to tokens in the same document field? An obvious > >> > example would be: "this text seems to be in large, bold, blinking > >> > letters: I'll assume it's more important than the surrounding 8px > >> > text." > >> > >> No, it can't (at least not yet). As a workaround you can extract these > >> portions of text to another field (or multiple fields), and then add > >> them with a higher boost. Then, expand your queries so that they include > >> also this field. This way, if query matches these special tokens, > >> results will get higher rank because of matching on this boosted field. > > > > I thought a workaround like that would be needed. Still, it could give > > useful results...though as a nutch user, the possibility is mostly > > theoretical for me, as probably none of the existing parsers take into > > account the formatting information. I could be completely wrong here, > > so please, feel free to correct me. > > You can write a HtmlParseFilter, which will extract these portions of > text and put them into ParseData.metadata. Then, during indexing you can > check if such metadata exists and if yes - add it as separate fields. > You will need also to modify the QueryFilters, to expand user queries to > also include clauses for these additional fields.
Thanks Andrzej, I understand the concepts involved now. If the need arises, I'll see what I can do about making it work as intended. t.n.a. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
