[jira] Created: (NUTCH-441) Thai Analyzer Plugin
Thai Analyzer Plugin Key: NUTCH-441 URL: https://issues.apache.org/jira/browse/NUTCH-441 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Vee Satayamas This Thai analyzer plugin was created by coping and modifying the French analyzer plugin. However, there is no Thai analyzer in lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus lucene-analyzers-nightly.jar was used instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-441) Thai Analyzer Plugin
[ https://issues.apache.org/jira/browse/NUTCH-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vee Satayamas updated NUTCH-441: Attachment: nutch-plugin-analysis-th-20070207.patch.gz Thai Analyzer (lib-lucene-analyzers modification is not included in the patch) Thai Analyzer Plugin Key: NUTCH-441 URL: https://issues.apache.org/jira/browse/NUTCH-441 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Vee Satayamas Attachments: nutch-plugin-analysis-th-20070207.patch.gz This Thai analyzer plugin was created by coping and modifying the French analyzer plugin. However, there is no Thai analyzer in lucene-analyzers-2.0.0.jar (in lib-lucene-analyzers). Thus lucene-analyzers-nightly.jar was used instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug
How nuch can be used to build a verticalo search engine?
I am trying to build a vertical search engine using rule based crawling strategy. I finished this part as a web application and i want to combine with nutch to control and select the URLs to be fetched. Is there any ideas how to do that? - Finding fabulous fares is fun. Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
How nuch can be used to build a vertical search engine?
I am trying to build a vertical search engine using rule based crawling strategy. I finished this part as a web application and i want to combine with nutch to control and select the URLs to be fetched. Is there any ideas how to do that? - We won't tell. Get more on shows you hate to love (and love to hate): Yahoo! TV's Guilty Pleasures list.
[jira] Created: (NUTCH-442) Integrate Solr/Nutch
Integrate Solr/Nutch Key: NUTCH-442 URL: https://issues.apache.org/jira/browse/NUTCH-442 Project: Nutch Issue Type: New Feature Environment: Ubuntu linux Reporter: rubdabadub Hi: After trying out Sami's patch regarding Solr/Nutch. Can be found here (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) and I can confirm it worked :-) And that lead me to request the following : I would be very very great full if this could be included in nutch 0.9 as I am trying to eliminate my python based crawler which post documents to solr. As I am in the corporate enviornment I can't install trunk version in the production enviornment thus I am asking this to be included in 0.9 release. I hope my wish would be granted. I look forward to get some feedback. Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: RSS-fecter and index individul-how can i realize this function
Guys, Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... Cheers, Chris On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: RSS-fecter and index individul-how can i realize this function
Doug Cutting wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? Exactly. So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. I think so, too. I have opened an issue in JIRA (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try. Doğacan, have you started working on it yet? Thanks, Renaud
Re: RSS-fecter and index individul-how can i realize this function
Chris Mattmann wrote: Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead, a Content could potentially generate many Parses. For most types of content, e.g., HTML, each Content would still generate a single Parse. But for RSS, a Content might generate multiple Parses, each indexed separately and each with a distinct URL. Another potential application could be processing archives: the parser could unpack the archive and each item in it indexed separately rather than indexing the archive as a whole. This only makes sense if each item has a distinct URL, which it does in RSS, but it might not in an archive. However some archive file formats do contain URLs, like that used by the Internet Archive. http://www.archive.org/web/researcher/ArcFileFormat.php Does that help? Doug
NPE while fetching
Hi, I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop 0.11.1 java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java: 2392) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498 ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372) Any pointers to the cause? Thanks, Gal.
Re: NPE while fetching
This was corrected in Hadoop as per issue HADOOP-917, but I'm thinking some code in Nutch might have to be changed also. I reported this issue (via mailing list) a while ago and I'm glad it was fixed, but I have been purposely staying with revision 495214 of trunk which seems to provide the best stability/performance. - Original Message From: Gal Nitzan [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, February 7, 2007 6:36:19 PM Subject: NPE while fetching Hi, I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop 0.11.1 java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java: 2392) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498 ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372) Any pointers to the cause? Thanks, Gal.
Re: RSS-fecter and index individul-how can i realize this function
Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. These are excellent points I am totally +1 for the api change, it opens doors for a lot of new possible applications. -- Sami Siren
[jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-439: Attachment: tld_plugin_v1.1.patch I have forgotten to unset http.agent.name in the v1.0 accidentally. this version is the same except agent name is not set. This patch obsoletes v1.0. Top Level Domains Indexing / Scoring Key: NUTCH-439 URL: https://issues.apache.org/jira/browse/NUTCH-439 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure, generic(such as com, edu) and country code tlds(such as en, de , tr, ). Indexing the top level domain and optionally boosting is needed for improving the search results and enhancing locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: RSS-fecter and index individul-how can i realize this function
Renaud Richardet wrote: Doug Cutting wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? Exactly. So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. I think so, too. I have opened an issue in JIRA (https://issues.apache.org/jira/browse/NUTCH-443) and will give it a try. Doğacan, have you started working on it yet? I have just started working on it. I hope I will have something (at least a patch for everything but plugins) within the day. -- Doğacan Güney Thanks, Renaud