Re: Inquiries on potential improvements

Sebastian Nagel Sat, 21 Jan 2023 05:15:55 -0800

Hi Kamil,

> 1. json-indexer: indexes documents in json lines format


Sounds good. There's already an indexer-csv (works only in local mode).

> 2. selenium extracts the html tag vs the body tag

Definitely makes sense.

> I am hesitant about this change because it could have bigger effects.

In doubt, could add a new method and make it configurable whether inner or outerHTML is returned.


3. Add ability to extract meta tags with "property" attribute

+1

(should be applied also to parse-tika - there's some duplicated code betweenparse-html and parse-tika)


> 4. Allow selenium to handle gzip content

Definitely. However, I wounder whether this couldn't be delegated to one of theother HTTP protocol plugins. But I agree this might be tricky:

- calling a plugin from another one
- maybe better in the way discussed in the user list:
  send a HEAD request and decide which way to go given on the Content-Type
  response header

> 5. Treat RSS feeds as normal webpages by adding links to next segment fetch

This is actually done if you let parse-tika parse the feeds. The "feed" pluginis very special in this respect. It takes every feed item as one document andforwards it to the index. This is a different use case. I'm open fordiscussions, however.


Thanks for your contributions!

Best,
Sebastian

On 1/20/23 16:28, Kamil Mroczek wrote:

Hello,
I have a few improvements to Nutch that I would like to get feedback on whetherthis community thinks I should submit them to the main branch. Once I get myfirst PR approved I can start to add these. Some of these might not be goodideas as well so happy to hear that feedback.
1. json-indexer: indexes documents in json lines format
2. selenium extracts the html tag vs the body tag (sample commit<https://github.com/Elio-Earth/nutch/commit/0e2aece37e3a8908221a50ee5803b8d3edd5a33e>): I needed this to extract the title of the page since that often lives in the head tag. I am hesitant about this change because it could have bigger effects.
3. Add ability to extract meta tags with "property" attribute (sample commit<https://github.com/Elio-Earth/nutch/commit/fa23f28f66a00f9048e72a0481c29ba01fc9af56>).
4. Allow selenium to handle gzip content (sample commit<https://github.com/Elio-Earth/nutch/commit/611387097a2f38e345e397bf32d515cea22729fc>): This is a port of the code from HTMLUnit that does the same thing. I needed this to process RSS feeds properly.
5. Treat RSS feeds as normal webpages by adding links to next segment fetch(sample commit<https://github.com/Elio-Earth/nutch/commit/1f8e4a9ee11de3c1a95926dbe279c0258d4c4438>)
Kamil

Re: Inquiries on potential improvements

Reply via email to