Hi Julien and everybody,
First, thanks for the support. It's really helping us.
We've studied Nutch and TC this week, and have been analyzing your
suggestions too. As we understood, you suggest that we insert the categories
as metadata with the URL on the CrawlDB, fetch pages with Nutch. Util this
part OK but it seems strange about the part of custom plugin that would get
the corpus to then generate the model. The interface of the plugin works
with only one NutchDocument at time. so we understood to do exactly this
idea, we should first index with a plugin that get only the corpus and save
them in a file (for example) and then call another program that get this
file and generate the model. After it We should need to change the indexer
plugin to use another plugin that will use the model generated and classify
the other crawled pages. With this in mind, we found a little bit strange to
have 2 indexer plugins that need to be called twice and about the file to
put the corpus we were afraid about concurrency. Another thing we had mind
was how can we keep it flexible and try to not change Nutch, creating a
modified version. But depending of the effort to do what we want, we can
consider change it.
So we thought a little different process that can put in this steps:
1 - Execute new DMOZ parser - generates flat file with all urls and
metadata
2 - Filter flat file (we won't use all DMOZ urls so here we can select
in different ways what we want )
3 - Inject data from flat on CrawlDB
4 - Crawl (fetch DMOZ webpages and more)
5 - Classification:
5.1 - Run JMS service &- Configure Nutch (XML) to use our indexing
plug-in
5.3 - Call Nutch to index (just add the documents to JMS query)
5.4 - Run a application that gets the documents from JMS query,
generates the model and use it to classify all other documents from Nutch's
base
6 - Create a query plugin to filter using a request by QueryString and
metadata.
But we have some doubts.
First, we decided to use JMS because the Indexing plug-in is called for
every document, and to make the SVM model we have to add all of them to the
TrainingCorpus before proceeding. So we decided that the plugin could add
every document to the JMS query and after that another process could take
the documents, generate the SVM model and use it to classify the other
documents that are on the base. Is there a better way to do that? Would you
or somebody suggest a different approach.
About the classification (5), we're thinking to open Nutch's database and
get the documents before actually calling LibLinear to classify, due it need
the corpus. And then we also need to change in database the crawled pages
because we are going to classify it. Is there a way to do that with Hadoop
without getting every document independently (as we can see in the index
plugin)? In another words, can we select many documents at once? Is it
difficult to open the CrawlDB from another program to do it? If there is
examples we would be very thankful? (I think its CrawlDB, I dunn't know for
sure where its stored)
And about (6), we can filter by the metadata of each page of the result,
right?
Finally, any other suggestion from you guys or tips, please send to us.
Thanks.
Best regards,
Luan Cestari
Daniel Gimenes
On Thu, Jul 8, 2010 at 10:37 AM, Julien Nioche <
[email protected]> wrote:
> Hi,
>
>
>> - Using a modified version of DmozParser to initialize crawlDB with
>> the URLs ANDD the top classification.
>> - Change the crawler to fetch the pages with and include two fields on
>> the webDB:
>> - One field for the classification (get from DMOZ file or
>> classified after by other process)
>> - Another field to indicate if the pages is on DMOZ list (so that
>> it's going to be the corpus of the svm training).
>> - After all pages are crawled, a new routine will generate the SVM
>> model by consulting webDB for the corpus, it's classification and the
>> number
>> of in links and out links.
>> - After that, we'll use the model to classify the other pages at
>> WebDB.
>> - At last, we'll change the search module to get a parameter from the
>> QueryString that indicates the category (or "nothing" if the user chooses
>> not to filter by category) and do the query using it.
>>
>> It would probably simplify things a lot if you do the training stage
> independently from Nutch. You will certainly try different sets of weighting
> and parameters to assess the impact on the classification accuracy. Here's
> how I'd do it :
>
> * get a list of URLs and their DMOZ category
> * inject that in the crawlDB and specify the category as metadata e.g.
> http://www.myurl.com/ label=DMOZ_categ
> * fetch and parse in the usual way
>
> the next step would be to extract the documents and generate a training
> corpus. IMHO an interesting aspect is that you have a semi-structured
> information e.g. titles, anchors, meta descriptions, keywords and of course
> the text itself. This should definitely give you more options as to what can
> be used for generating the features. The TC API can take a semi structured
> representation as input.
>
> What you could do would be to write a custom indexer in order to benefit
> from the fields from the NutchDocument at indexing time and generate a
> training corpus for the ML. You then evaluate and train the model and when
> that's done you can use it to classify new documents in Nutch.
>
> This can be done as an indexing plugin. Basically the parser uses the TC
> API to generate a new field which you can then use in SOLR (or in the
> Nutch-Lucene index) to search / filter etc...
>
> If you haven't done so it could be worth having a look at Apache Mahout as
> well
>
> HTH
>
> Julien
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>