[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914050#action_12914050
]
Doğacan Güney commented on NUTCH-882:
-------------------------------------
I have implemented a NutchContext object (which only has a Host in it for
now). I also added a fast Host reader as Andrzej suggested by using bloom
filters. For now, I also extended InjectorJob to have a NutchContext object and
extended scoring filters to also accept NutchContext as an argument (only
scoring filters for now, but I will extend this to all plugins). The fast host
reader uses a new table (called metatable... yeah not very creative :), to read
and write bloom filter data. The idea is, obviously, metatable stores
information about other tables.
Unfortunately, there is a huge problem that I need help with. I will try to
explain it with an example. Let's say a ParserJob has 6 maps. We extended parse
plugins so they also can use NutchContext objects. The problem is each map will
update its OWN bloom filter and try to write its OWN bloom filter back to
metatable. This, of course, breaks HostDb implementation as one map task
overwrites bloom filter data. As a fix, I thought each task can write its own
bloom filter to a temporary location using its task id. Once a job finishes, we
can then read all tasks and write out a single bloom filter using data from all
tasks. This is a very HACKISH solution though.
What do you guys think? Any better solutions?
> Design a Host table in GORA
> ---------------------------
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 2.0
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and
> domains?) would be very useful for :
> * customising the behaviour of the fetching on a host basis e.g. number of
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages
> * keeping a copy of the robots.txt and possibly use that later to filter the
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments
> are of course already welcome
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.