[Nutch-dev] Nutch and Lucene
Hello, This is what I want to do. Given a document, find all its terms and frequencies. I understand that Nutch is built on top of Lucene. In Lucene, I can access the terms and their frequencies of a document via the indexreader. However, in nutch, I am not sure if there's an equivalent. In Lucene, indexreader needs to know where the inverted indexes are. In Nutch, I am not sure how and where to locate the inverted indexes. Is it possible to access the inverted index from Nutch? Thank you very much for your help. -- View this message in context: http://www.nabble.com/Nutch-and-Lucene-tf2606327.html#a7272844 Sent from the Nutch - Dev mailing list archive at Nabble.com. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Nutch and Lucene
hzhong wrote: Hello, This is what I want to do. Given a document, find all its terms and frequencies. I understand that Nutch is built on top of Lucene. In Lucene, I can access the terms and their frequencies of a document via the indexreader. However, in nutch, I am not sure if there's an equivalent. In Lucene, indexreader needs to know where the inverted indexes are. In Nutch, I am not sure how and where to locate the inverted indexes. Is it possible to access the inverted index from Nutch? What you need is named term vector. Nutch doesn't support this out of the box, but it;s relatively easy to add. You would have to modify org.apache.nutch.searcher.Searcher and add a method to retrieve TermVector - and implement this method in org.apache.nutch.searcher.IndexSearcher using Lucene classes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] 您好!项目合作
贵公司经理/财务(收): 您好!本公司是深圳市理税财务顾问有限公司,是财政局批准的一家从事代理记帐及税务 咨询的专业公司,现为客户提供全面的票据代理服务。本公司有各行业的普通销售发*票对外 代开,以及建筑业、运输业、广告业等服务业票据代*开。如有需要请来电洽谈咨询。请保留 此信息以备后用。 联系人:谢东/经理 手机:013544228444 - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed
[ http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ] Sami Siren commented on NUTCH-395: -- have you measured what made the biggest impact on performance - changes to Metadata, or changes to IO in FetcherOutput? did not have time yet, I would quess that IO changes make most signifigant part. After more digging my initial guess might not have been correct. By not touching IO at all I am able to get same improvement changing the trunk when comparing to nightly builds as I reported before on 0.8 branch. This is good, because we don't need to change file formats at all. Increase fetching speed --- Key: NUTCH-395 URL: http://issues.apache.org/jira/browse/NUTCH-395 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8.1 Reporter: Sami Siren Assigned To: Sami Siren Attachments: nutch-0.8-performance.txt There have been some discussion on nutch mailing lists about fetcher being slow, this patch tried to address that. the patch is just a quich hack and needs some cleaning up, it also currently applies to 0.8 branch and not trunk and it has also not been tested in large. What it changes? Metadata - the original metadata uses spellchecking, new version does not (a decorator is provided that can do it and it should perhaps be used where http headers are handled but in most of the cases the functionality is not required) Reading/writing various data structures - patch tries to do io more efficiently see the patch for details. Initial benchmark: A small benchmark was done to measure the performance of changes with a script that basically does the following: -inject a list of urls into a fresh crawldb -create fetchlist (10k urls pointing to local filesystem) -fetch -updatedb original code from 0.8-branch: real10m51.907s user10m9.914s sys 0m21.285s after applying the patch real4m15.313s user3m42.598s sys 0m18.485s -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] implement thai lanaguage analyzer in nutch
Oh, Thai words are not space delimited? OK, in that case, you'd need to study how ThaiAnalyzer works and then modify the rules in NutchAnalysis.jj (if you are going to use the web search GUI from Nutch). This is because the search expressions are parsed by the parser generated from NutchAnalysis.jj first before each term is handed to the language specific analyzer, and currently if a character belongs to the CJK category, each character is treated as though it were a word. If ThaiAnalyzer does not do the same, you can index the Thai docs but you won't be able to find any doc unless the search term is one Unicode character. -kuro -Original Message- From: sanjeev [mailto:[EMAIL PROTECTED] Sent: 2006-11-08 19:28 To: nutch-dev@lucene.apache.org Subject: Re: implement thai lanaguage analyzer in nutch I need a Thai Analyzer for Nutch. I want the crawler to be intelligent enough to split thai words correctly since thai don't have spaces between words. :-( ogjunk-nutch wrote: Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: sanjeev [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang=th attribute. -kuro -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut ch-tf2587282.html#a7251826 Sent from the Nutch - Dev mailing list archive at Nabble.com. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] Tease workout
- Original Message - From: nutch-developers@lists.sourceforge.net To: [EMAIL PROTECTED] Sent: Sunday, November 02:37:01 AM Subject: Tease workout Paul Wallplays of ratedwalk ratedmah ratedkaysz a ratedsad Mixesplays Endicott.Ratedmals ratedt ratedme chris dig. Bloggers Groups is Abuse.Rallye route Ireland names Kronos Australia or Dale Moscatt? Should Travel Blogs.Ratedmusic Byplays ratedplays ratednew am ratedshow of Rnbhiphop Chroplays ratedlil Playplays a.Dit Socrate a Daprs rumeur Carl Sagan am. Task thank supportthe Uniqueness.Psp ds gba mobile Worldwide Shop is Gameflycom Alienware. Tu des ou Clique ici Merci.Crosswords Astro Feeds. Become in receive free Join. Sponsors Events Site a map god am.Harrison founder guy Lalibert brings.Jeux a vido amliorent rflexes in moteurs capacits stratgie Musique Indpendant. Reverse Shrink Stretch Vibrato etcapply. Guerre or une affaire Jeux vido or amliorent.Number listing purchasing a yearly verify Wait results is notified Good. Curry all That Matters Mind Over Matter Open Space?Vibrato etcapply different filters selected Bandpass Filter.Sperated addresses playlist in Delicious Digg Rate. Tu des ou Clique ici Merci.Ratedmals ratedt ratedme chris dig.Lieing couple asked could in perform or striptease a thought rocked. Login Digi Sites Deutsch Chinese Japanese of Rabbit or.Artists Using tapes Abbey is Road am.Saluki ccr or fgt.Artists Using tapes Abbey is Road am.Ratedslow of Jamsplays ratedcarls. Stewart jj or Yeley is schedule round Flat Motorsport. Qui ne rentre dans am aucune autre. - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] 你好!
你好!本公司是一家常年主要以生产和销售为一体的纳税企业;长期以来享有国家 优惠政策,现我司有发/票向外代/开:普通.运输.广告.建筑等其它行业发/票。欢.迎来. 电洽谈详细合.作! 联系人:刘经理 手 机:13537877004 - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers