[jira] Created: (NUTCH-441) Thai Analyzer Plugin

2007-02-07 Thread Vee Satayamas (JIRA)
Thai Analyzer Plugin


 Key: NUTCH-441
 URL: https://issues.apache.org/jira/browse/NUTCH-441
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Vee Satayamas


This Thai analyzer plugin was created by coping and modifying the French 
analyzer plugin. However, there is no Thai analyzer in 
lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus 
lucene-analyzers-nightly.jar was used instead. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-441) Thai Analyzer Plugin

2007-02-07 Thread Vee Satayamas (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vee Satayamas updated NUTCH-441:


Attachment: nutch-plugin-analysis-th-20070207.patch.gz

Thai Analyzer (lib-lucene-analyzers modification is not included in the patch)

 Thai Analyzer Plugin
 

 Key: NUTCH-441
 URL: https://issues.apache.org/jira/browse/NUTCH-441
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Vee Satayamas
 Attachments: nutch-plugin-analysis-th-20070207.patch.gz


 This Thai analyzer plugin was created by coping and modifying the French 
 analyzer plugin. However, there is no Thai analyzer in 
 lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus 
 lucene-analyzers-nightly.jar was used instead. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL for 
each parse should be that of its link, since you don't want to fetch 
that separately.  Right?


So now the question is, how much impact would this change to the Parser 
API have on the rest of Nutch?  It would require changes to all Parser 
implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But, 
as far as I can tell, most of these changes look straightforward.


Doug


How nuch can be used to build a verticalo search engine?

2007-02-07 Thread ahmed ghouzia
I am trying to build a vertical search engine using  rule based crawling 
strategy.
I finished this part as a web application and i want to combine with nutch to 
control and select the URLs to be fetched. Is there any ideas how to do that?

 
-
Finding fabulous fares is fun.
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel 
bargains.

How nuch can be used to build a vertical search engine?

2007-02-07 Thread ahmed ghouzia
I am trying to build a vertical search engine using  rule based crawling 
strategy.
I finished this part as a web application and i want to combine with nutch to 
control and select the URLs to be fetched. Is there any ideas how to do that?

 
-
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.

[jira] Created: (NUTCH-442) Integrate Solr/Nutch

2007-02-07 Thread rubdabadub (JIRA)
Integrate Solr/Nutch


 Key: NUTCH-442
 URL: https://issues.apache.org/jira/browse/NUTCH-442
 Project: Nutch
  Issue Type: New Feature
 Environment: Ubuntu linux
Reporter: rubdabadub


Hi:

After trying out Sami's patch regarding Solr/Nutch. Can be found here 
(http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) 
and I can confirm it worked :-) And that lead me to request the following :

I would be very very great full if this could be included in nutch 0.9 as I am 
trying to eliminate my python based crawler which post documents to solr. As I 
am in the corporate enviornment I can't install trunk version in the production 
enviornment thus I am asking this to be included in 0.9 release. I hope my wish 
would be granted.

I look forward to get some feedback.

Thank you.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann
Guys,

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...

Cheers,
  Chris



On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Renaud Richardet wrote:
 I see. I was thinking that I could index the feed items without having
 to fetch them individually.
 
 Okay, so if Parser#parse returned a MapString,Parse, then the URL for
 each parse should be that of its link, since you don't want to fetch
 that separately.  Right?
 
 So now the question is, how much impact would this change to the Parser
 API have on the rest of Nutch?  It would require changes to all Parser
 implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But,
 as far as I can tell, most of these changes look straightforward.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-07 Thread Renaud Richardet (JIRA)
allow parsers to return multiple Parse object, this will speed up the rss parser


 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0


allow Parser#parse to return a MapString,Parse. This way, the RSS parser can 
return multiple parse objects, that will all be indexed separately. Advantage: 
no need to fetch all feed-items separately.
see the discussion at 
http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Renaud Richardet

Doug Cutting wrote:

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without 
having to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL 
for each parse should be that of its link, since you don't want to 
fetch that separately. Right?

Exactly.


So now the question is, how much impact would this change to the 
Parser API have on the rest of Nutch? It would require changes to all 
Parser implementations, to ParseSegement, to ParseUtil, and to 
Fetcher. But, as far as I can tell, most of these changes look 
straightforward.
I think so, too. I have opened an issue in JIRA 
(https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try.

Doğacan, have you started working on it yet?

Thanks,
Renaud



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Chris Mattmann wrote:

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...


A Content would no longer generate a single Parse.  Instead, a Content 
could potentially generate many Parses.  For most types of content, 
e.g., HTML, each Content would still generate a single Parse.  But for 
RSS, a Content might generate multiple Parses, each indexed separately 
and each with a distinct URL.


Another potential application could be processing archives: the parser 
could unpack the archive and each item in it indexed separately rather 
than indexing the archive as a whole.  This only makes sense if each 
item has a distinct URL, which it does in RSS, but it might not in an 
archive.  However some archive file formats do contain URLs, like that 
used by the Internet Archive.


http://www.archive.org/web/researcher/ArcFileFormat.php

Does that help?

Doug


NPE while fetching

2007-02-07 Thread Gal Nitzan
Hi,

I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop
0.11.1


java.lang.NullPointerException
at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:
2392)
at
org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498
)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)


Any pointers to the cause?

Thanks,

Gal.




Re: NPE while fetching

2007-02-07 Thread Sean Dean
This was corrected in Hadoop as per issue HADOOP-917, but I'm thinking some 
code in Nutch might have to be changed also. I reported this issue (via mailing 
list) a while ago and I'm glad it was fixed, but I have been purposely staying 
with revision 495214 of trunk which seems to provide the best 
stability/performance.

 
- Original Message 
From: Gal Nitzan [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wednesday, February 7, 2007 6:36:19 PM
Subject: NPE while fetching


Hi,

I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop
0.11.1


java.lang.NullPointerException
at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:
2392)
at
org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498
)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)


Any pointers to the cause?

Thanks,

Gal.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Sami Siren
 Also true.  On the other hand, Nutch provides 98% of an RSS search
 engine.  It'd be a shame to have to re-invent everything else and it
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a
 Parse for each image on a page whose text was from the page.  Product
 search too, perhaps.

These are excellent points I am totally +1 for the api change, it opens
doors for a lot of new possible applications.

--
 Sami Siren


[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-02-07 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:


Attachment: tld_plugin_v1.1.patch

I have forgotten to unset http.agent.name in the v1.0 accidentally. this 
version is the same except agent name is not set. This patch obsoletes v1.0. 


 Top Level Domains Indexing / Scoring
 

 Key: NUTCH-439
 URL: https://issues.apache.org/jira/browse/NUTCH-439
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch


 Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
 system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
 divides tlds into three. infrastructure, generic(such as com, edu) and 
 country code tlds(such as en, de , tr, ). Indexing the top level domain 
 and optionally boosting is needed for improving the search results and 
 enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doğacan Güney
Renaud Richardet wrote:
 Doug Cutting wrote:
 Renaud Richardet wrote:
 I see. I was thinking that I could index the feed items without
 having to fetch them individually.

 Okay, so if Parser#parse returned a MapString,Parse, then the URL
 for each parse should be that of its link, since you don't want to
 fetch that separately. Right?
 Exactly.

 So now the question is, how much impact would this change to the
 Parser API have on the rest of Nutch? It would require changes to all
 Parser implementations, to ParseSegement, to ParseUtil, and to
 Fetcher. But, as far as I can tell, most of these changes look
 straightforward.
 I think so, too. I have opened an issue in JIRA
 (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try.
 Doğacan, have you started working on it yet?

I have just started working on it. I hope I will have something (at
least a patch for
everything but plugins) within the day.

--
Doğacan Güney


 Thanks,
 Renaud