Re: Filtering out unwanted content from HTML pages

Erlend Garåsen Sun, 03 Apr 2011 13:12:00 -0700

I agree with you. I also discussed this with a colleague, and we decidedto try to rewrite or extend some of the Tika classes in order to getthis functionality. I'll notify the list if I manage to fix this, but itmight take some time since we're not working with content enrichment yet.


Erlend

On 31.03.11 16.56, Karl Wright wrote:

This is a good question.  I think we should carry this conversation
forward on connectors-dev.

My initial thought on this issue is that the functionality really
belongs in Tika.  Tika is set up to extract and filter in exactly this
way.  The only reason you'd want to do it in MCF is if it would change
the links you might extract (or, skip), and that seems to me less
interesting.  How do you feel about it?

Karl

On Thu, Mar 31, 2011 at 10:41 AM, Erlend Garåsen
<[email protected]>  wrote:


All major commercial search engines are shipped with a web crawler which
allows one to filter out unwanted content, such as certain html blocks,
comments etc. Would it be advisable to add such a functionality to MCF? Or
will it be difficult to implement since the idea behind the
ExtractingRequestHandler is to send binary files to Solr?

Say that you have an HTML document which includes the following comments:
<!-- stop indexing -->
<!-- start indexing -->
All content within these comments should then be skipped from the index.

I managed to rewrite Apache Nutch in order to add this functionality for
some months ago.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Filtering out unwanted content from HTML pages

Reply via email to