[Nutch-dev] Nutch and Lucene

2006-11-10 Thread hzhong

Hello,

This is what I want to do.  Given a document, find all its terms and
frequencies.  

I understand that Nutch is built on top of Lucene.  In Lucene, I can access
the terms and their frequencies of a document via the indexreader.  However,
in nutch, I am not sure if there's an equivalent.  In Lucene, indexreader
needs to know where the inverted indexes are.  In Nutch, I am not sure how
and where to locate the inverted indexes.  

Is it possible to access the inverted index from Nutch?  

Thank you very much for your help.
-- 
View this message in context: 
http://www.nabble.com/Nutch-and-Lucene-tf2606327.html#a7272844
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Nutch and Lucene

2006-11-10 Thread Andrzej Bialecki
hzhong wrote:
 Hello,

 This is what I want to do.  Given a document, find all its terms and
 frequencies.  

 I understand that Nutch is built on top of Lucene.  In Lucene, I can access
 the terms and their frequencies of a document via the indexreader.  However,
 in nutch, I am not sure if there's an equivalent.  In Lucene, indexreader
 needs to know where the inverted indexes are.  In Nutch, I am not sure how
 and where to locate the inverted indexes.  

 Is it possible to access the inverted index from Nutch?  
   

What you need is named term vector. Nutch doesn't support this out of 
the box, but it;s relatively easy to add. You would have to modify 
org.apache.nutch.searcher.Searcher and add a method to retrieve 
TermVector - and implement this method in 
org.apache.nutch.searcher.IndexSearcher using Lucene classes.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] 您好!项目合作

2006-11-10 Thread 合作*项目
贵公司经理/财务(收):
您好!本公司是深圳市理税财务顾问有限公司,是财政局批准的一家从事代理记帐及税务
咨询的专业公司,现为客户提供全面的票据代理服务。本公司有各行业的普通销售发*票对外
代开,以及建筑业、运输业、广告业等服务业票据代*开。如有需要请来电洽谈咨询。请保留
此信息以备后用。

联系人:谢东/经理 手机:013544228444

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-395) Increase fetching speed

2006-11-10 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ] 

Sami Siren commented on NUTCH-395:
--

have you measured what made the biggest impact on performance - changes to 
Metadata, or
changes to IO in FetcherOutput?
did not have time yet, I would quess that IO changes make most signifigant 
part. 

After more digging my initial guess might not have been correct. By not 
touching IO at all
I am able to get same improvement changing the trunk when comparing to nightly 
builds as
I reported before on 0.8 branch.

This is good, because we don't need to change file formats at all.



 Increase fetching speed
 ---

 Key: NUTCH-395
 URL: http://issues.apache.org/jira/browse/NUTCH-395
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8.1
Reporter: Sami Siren
 Assigned To: Sami Siren
 Attachments: nutch-0.8-performance.txt


 There have been some discussion on nutch mailing lists about fetcher being 
 slow, this patch tried to address that. the patch is just a quich hack and 
 needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
 and it has also not been tested in large. What it changes?
 Metadata - the original metadata uses spellchecking, new version does not (a 
 decorator is provided that can do it and it should perhaps be used where http 
 headers are handled but in most of the cases the functionality is not 
 required)
 Reading/writing various data structures - patch tries to do io more 
 efficiently see the patch for details.
 Initial benchmark:
 A small benchmark was done to measure the performance of changes with a 
 script that basically does the following:
 -inject a list of urls into a fresh crawldb
 -create fetchlist (10k urls pointing to local filesystem)
 -fetch
 -updatedb
 original code from 0.8-branch:
 real10m51.907s
 user10m9.914s
 sys 0m21.285s
 after applying the patch
 real4m15.313s
 user3m42.598s
 sys 0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] implement thai lanaguage analyzer in nutch

2006-11-10 Thread Teruhiko Kurosaka
Oh, Thai words are not space delimited?
OK, in that case, you'd need to study how ThaiAnalyzer works and
then modify the rules in NutchAnalysis.jj (if you are going to use
the web search GUI from Nutch).  This is because the search
expressions are parsed by the parser generated from NutchAnalysis.jj
first before each term is handed to the language specific analyzer,
and currently if a character belongs to the CJK category, each character
is treated as though it were a word.  If ThaiAnalyzer does not do the
same,
you can index the Thai docs but you won't be able to find any doc unless
the search term is one Unicode character.


-kuro

 -Original Message-
 From: sanjeev [mailto:[EMAIL PROTECTED] 
 Sent: 2006-11-08 19:28
 To: nutch-dev@lucene.apache.org
 Subject: Re: implement thai lanaguage analyzer in nutch
 
 
 I need a Thai Analyzer for Nutch. I want the crawler to be 
 intelligent enough
 to split thai words correctly since thai don't have spaces 
 between words.
 :-(
 
 
 
 
 ogjunk-nutch wrote:
  
  Regarding Thai, there is a Thai Analyzer in Lucene already:
  
  $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
  total 24
  drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
  -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
  -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
  
  Otis
  
  - Original Message 
  From: Teruhiko Kurosaka [EMAIL PROTECTED]
  To: sanjeev [EMAIL PROTECTED]; 
 nutch-dev@lucene.apache.org
  Sent: Wednesday, November 8, 2006 2:16:38 PM
  Subject: RE: implement thai lanaguage analyzer in nutch
  
  Sanjay,
  I don't think you should follow the Chinese example and 
 extend the CJK
  range. 
  This was needed because Chinese and Japanese don't use 
 space to separate
  words.  I believe Thai uses spaces, right? If so, you should extend
  LETTER
  range to include Thai character rather than CJK.
  
  Another place you would need to change is the LanguageIdentifier. 
  You would either train it, or implement some hack,  in 
 order for it to
  be able to 
  detect Thai language documents that are not of HTML with lang=th
  attribute.
  
  -kuro
  
  
  
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
 ch-tf2587282.html#a7251826
 Sent from the Nutch - Dev mailing list archive at Nabble.com.
 
 

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


Re: [Nutch-dev] Tease workout

2006-11-10 Thread Toni pmSex





  - Original Message - 
  From: nutch-developers@lists.sourceforge.net 

  To: [EMAIL PROTECTED] 
  Sent: Sunday, November 02:37:01 AM 
  Subject: Tease workout
  
Paul Wallplays of ratedwalk ratedmah ratedkaysz a 
ratedsad Mixesplays Endicott.Ratedmals ratedt ratedme chris dig. Bloggers 
Groups is Abuse.Rallye route Ireland names Kronos Australia or Dale Moscatt? 
Should Travel Blogs.Ratedmusic Byplays ratedplays ratednew am ratedshow of 
Rnbhiphop Chroplays ratedlil Playplays a.Dit Socrate a Daprs rumeur Carl 
Sagan am. Task thank supportthe Uniqueness.Psp ds gba mobile Worldwide Shop 
is Gameflycom Alienware. Tu des ou Clique ici Merci.Crosswords Astro Feeds. 
Become in receive free Join. Sponsors Events Site a map god am.Harrison 
founder guy Lalibert brings.Jeux a vido amliorent rflexes in moteurs 
capacits stratgie Musique Indpendant. Reverse Shrink Stretch Vibrato etcapply. 
Guerre or une affaire Jeux vido or amliorent.Number listing purchasing a 
yearly verify Wait results is notified Good. Curry all That Matters Mind Over 
Matter Open Space?Vibrato etcapply different filters selected Bandpass 
Filter.Sperated addresses playlist in Delicious Digg Rate. Tu des ou Clique 
ici Merci.Ratedmals ratedt ratedme chris dig.Lieing couple asked could 
in perform or striptease a thought rocked. Login Digi Sites Deutsch Chinese 
Japanese of Rabbit or.Artists Using tapes Abbey is Road am.Saluki ccr or 
fgt.Artists Using tapes Abbey is Road am.Ratedslow of Jamsplays 
ratedcarls. Stewart jj or Yeley is schedule round Flat Motorsport. Qui ne rentre 
dans am aucune autre.
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] 你好!

2006-11-10 Thread opiiigd
  你好!本公司是一家常年主要以生产和销售为一体的纳税企业;长期以来享有国家
  优惠政策,现我司有发/票向外代/开:普通.运输.广告.建筑等其它行业发/票。欢.迎来.
  电洽谈详细合.作!
  联系人:刘经理  手 机:13537877004

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers