Julien,

By studying Nutch and TC more, we decided to not use JMS as we said before.
We thought that the best thing we can do is access nutch's database from an
external application to get the fetched documents from DMOZ, generate de
classifier and classify the other crawled pages.

At the moment we are seaching for more detail about how to do that (access
nutch's database). If you have some tips, we are glad to read it =-)

Regards,
Daniel
Luan

On Sun, Jul 11, 2010 at 11:29 PM, Luan Cestari <[email protected]>wrote:

> Hi Julien and everybody,
>
> First, thanks for the support. It's really helping us.
>
> We've studied Nutch and TC this week, and have been analyzing your
> suggestions too. As we understood, you suggest that we insert the categories
> as metadata with the URL on the CrawlDB, fetch pages with Nutch. Util this
> part OK but it seems strange about the part of custom plugin that would get
> the corpus to then generate the model. The interface of the plugin works
> with only one NutchDocument at time. so we understood to do exactly this
> idea, we should first index with a plugin that get only the corpus and save
> them in a file (for example) and then call another program that get this
> file and generate the model. After it We should need to change the indexer
> plugin to use another plugin that will use the model generated and classify
> the other crawled pages. With this in mind, we found a little bit strange to
> have 2 indexer plugins that need to be called twice and about the file to
> put the corpus we were afraid about concurrency. Another thing we had mind
> was how can we keep it flexible and try to not change Nutch, creating a
> modified version. But depending of the effort to do what we want, we can
> consider change it.
>
> So we thought a little different process that can put in this steps:
>
> 1 - Execute new DMOZ parser - generates flat file with all urls and
> metadata
> 2 - Filter flat file (we won't use all DMOZ urls so here we can select
> in different ways what we want )
> 3 - Inject data from flat on CrawlDB
> 4 - Crawl (fetch DMOZ webpages and more)
> 5 - Classification:
>     5.1 - Run JMS service &- Configure Nutch (XML) to use our indexing
> plug-in
>     5.3 - Call Nutch to index (just add the documents to JMS query)
>     5.4 - Run a application that gets the documents from JMS query,
> generates the model and use it to classify all other documents from Nutch's
> base
> 6 - Create a query plugin to filter using a request by QueryString and
> metadata.
>
>
> But we have some doubts.
>
> First, we decided to use JMS because the Indexing plug-in is called for
> every document, and to make the SVM model we have to add all of them to the
> TrainingCorpus before proceeding. So we decided that the plugin could add
> every document to the JMS query and after that another process could take
> the documents, generate the SVM model and use it to classify the other
> documents that are on the base. Is there a better way to do that? Would you
> or somebody suggest a different approach.
>
> About the classification (5), we're thinking to open Nutch's database and
> get the documents before actually calling LibLinear to classify, due it need
> the corpus. And then we also need to change in database the crawled pages
> because we are going to classify it. Is there a way to do that with Hadoop
> without getting every document independently (as we can see in the index
> plugin)? In another words, can we select many documents at once? Is it
> difficult to open the CrawlDB from another program to do it? If there is
> examples we would be very thankful? (I think its CrawlDB, I dunn't know for
> sure where its stored)
>
> And about (6), we can filter by the metadata of each page of the result,
> right?
>
> Finally, any other suggestion from you guys or tips, please send to us.
>
> Thanks.
>
> Best regards,
> Luan Cestari
> Daniel Gimenes
>
>
> On Thu, Jul 8, 2010 at 10:37 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi,
>>
>>
>>>    - Using a modified version of DmozParser to initialize crawlDB with
>>>    the URLs ANDD the top classification.
>>>    - Change the crawler to fetch the pages with and include two fields
>>>    on the webDB:
>>>       - One field for the classification (get from DMOZ file or
>>>       classified after by other process)
>>>       - Another field to indicate if the pages is on DMOZ list (so that
>>>       it's going to be the corpus of the svm training).
>>>    - After all pages are crawled, a new routine will generate the SVM
>>>    model by consulting webDB for the corpus, it's classification and the 
>>> number
>>>    of in links and out links.
>>>    - After that, we'll use the model to classify the other pages at
>>>    WebDB.
>>>    - At last, we'll change the search module to get a parameter from the
>>>    QueryString that indicates the category (or "nothing" if the user chooses
>>>    not to filter by category) and do the query using it.
>>>
>>> It would probably simplify things a lot if you do the training stage
>> independently from Nutch. You will certainly try different sets of weighting
>> and parameters to assess the impact on the classification accuracy. Here's
>> how I'd do it :
>>
>> * get a list of URLs and their DMOZ category
>> * inject that in the crawlDB and specify the category as metadata e.g.
>> http://www.myurl.com/ label=DMOZ_categ
>> * fetch and parse in the usual way
>>
>> the next step would be to extract the documents and generate a training
>> corpus. IMHO an interesting aspect is that you have a semi-structured
>> information e.g. titles, anchors, meta descriptions, keywords and of course
>> the text itself. This should definitely give you more options as to what can
>> be used for generating the features. The TC API can take a semi structured
>> representation as input.
>>
>> What you could do would be to write a custom indexer in order to benefit
>> from the fields from the NutchDocument at indexing time and generate a
>> training corpus for the ML. You then evaluate and train the model and when
>> that's done you can use it to classify new documents in Nutch.
>>
>> This can be done as an indexing plugin. Basically the parser uses the TC
>> API to generate a new field which you can then use in SOLR (or in the
>> Nutch-Lucene index) to search / filter etc...
>>
>> If you haven't done so it could be worth having a look at Apache Mahout as
>> well
>>
>> HTH
>>
>> Julien
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>
>

Reply via email to