[JIRA] Updated: (NXSEM-7) develop hadoop based toolset to build categorized TF-IDF corpora to train document classification models

Olivier Grisel (JIRA NUXEO) Tue, 31 May 2011 05:44:05 -0700

     [ 
https://jira.nuxeo.com/browse/NXSEM-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Olivier Grisel updated NXSEM-7:
-------------------------------

       Priority: Major  (was: Minor)
    Description: 
The toolset should be packaged to be easily deployable on AWS using the 
Cloudera Distribution for Hadoop 2 AMI [1] and the wikipedia XML dump AWS 
Dataset [2].

The Mahout project already has some partial implementation of this (lacking the 
TF-IDF [3] part). To avoid having to load a huge dictionary in memory, we plan 
to leverage a hashed representation [4] of the term and document frequencies.

[1] http://archive.cloudera.com/docs/ec2.html
[2] 
http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html
[3] http://en.wikipedia.org/wiki/Tf-idf
[4] http://hunch.net/~jl/projects/hash_reps/index.html

Edit: the in-memory hashed representation is replaced by sparse persisted 
representation using Solr / Lucene. The TF-IDF k-Nearest Neighbors lookups will 
be implemented using MoreLikeThis queries instead. See comments for details.


  was:
The toolset should be packaged to be easily deployable on AWS using the 
Cloudera Distribution for Hadoop 2 AMI [1] and the wikipedia XML dump AWS 
Dataset [2].

The Mahout project already has some partial implementation of this (lacking the 
TF-IDF [3] part). To avoid having to load a huge dictionary in memory, we plan 
to leverage a hashed representation [4] of the term and document frequencies.

[1] http://archive.cloudera.com/docs/ec2.html
[2] 
http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html
[3] http://en.wikipedia.org/wiki/Tf-idf
[4] http://hunch.net/~jl/projects/hash_reps/index.html



> develop hadoop based toolset to build categorized TF-IDF corpora to train 
> document classification models
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NXSEM-7
>                 URL: https://jira.nuxeo.com/browse/NXSEM-7
>             Project: Nuxeo Semantic R&D
>          Issue Type: Task
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>            Priority: Major
>             Fix For: 5.4.2
>
>
> The toolset should be packaged to be easily deployable on AWS using the 
> Cloudera Distribution for Hadoop 2 AMI [1] and the wikipedia XML dump AWS 
> Dataset [2].
> The Mahout project already has some partial implementation of this (lacking 
> the TF-IDF [3] part). To avoid having to load a huge dictionary in memory, we 
> plan to leverage a hashed representation [4] of the term and document 
> frequencies.
> [1] http://archive.cloudera.com/docs/ec2.html
> [2] 
> http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html
> [3] http://en.wikipedia.org/wiki/Tf-idf
> [4] http://hunch.net/~jl/projects/hash_reps/index.html
> Edit: the in-memory hashed representation is replaced by sparse persisted 
> representation using Solr / Lucene. The TF-IDF k-Nearest Neighbors lookups 
> will be implemented using MoreLikeThis queries instead. See comments for 
> details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

[JIRA] Updated: (NXSEM-7) develop hadoop based toolset to build categorized TF-IDF corpora to train document classification models

Reply via email to