+1 to looking at how we can leverage Nutch code. If there is an active community in Tika maintaining language profiles, then let's manage it in Tika (which was one of its original goals). Then Nutch can continue to be a consumer of such functionality and move to the Nutch2 delegator architecture we've been working in Nutch towards.
Cheers, Chris On Apr 14, 2011, at 7:16 AM, Oleg Tikhonov wrote: > Sami, > Chris and me, some time ago did that for developerWorks tutorial, the > "clean" code exist, although may be out of day. > I thought, is it good idea to use Nutch code inside Tika? Might be Nutch > guys could extend it as independent module? > > > > On Thu, Apr 14, 2011 at 3:01 PM, Sami Siren (JIRA) <[email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/TIKA-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019793#comment-13019793] >> >> Sami Siren commented on TIKA-546: >> --------------------------------- >> >> bq. Do we build the "LanguageProfilerBuilder" from Nutch code here locally >> and ship it as binary package/library or as part of mvn install task/ ant >> task? >> >> I would just do what Jan suggested = get the relevant source files from >> Nutch, modify them as needed (like remove dependencies etc) and commit this >> into Tika svn repository. >> >> >> >>> Add ability to create language profiles to tika-app >>> --------------------------------------------------- >>> >>> Key: TIKA-546 >>> URL: https://issues.apache.org/jira/browse/TIKA-546 >>> Project: Tika >>> Issue Type: New Feature >>> Components: cli, languageidentifier >>> Affects Versions: 0.7 >>> Reporter: Jan Høydahl >>> >>> Since TIKA-490 it is supposed to be easy adding new language profiles to >> TIKA. However, currently the process involves using Nutch's NGramProfile >> tool and editing the output. >>> We should port Nutch's profile builder to Tika and make it part of >> tika-app.jar: >>> # See http://wiki.apache.org/nutch/LanguageIdentifier >>> # java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] >> [--maxlines=<max>] <profile-name> <filename> <encoding> >>> Using --gramsizes and --maxlines, we could support both Tika-style >> profiles and Nutch-style profiles and thus deprecate the Nutch tool. >> Defaults should be --gramsizes=3 --maxlines=1000 >> >> -- >> This message is automatically generated by JIRA. >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
