[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-706:


Fix Version/s: (was: 1.1)

Both variants of the substitution rule above break existing tests. More work 
will be needed to get a pattern which covers the case described by Meghna *and* 
is compatible with the existing test cases.
Moving it to post-1.1

 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923
 ] 

Ken Krugler commented on NUTCH-706:
---

Two comments about this:

1. From my experiences with Nutch  Bixo, I think that URL normalization 
ultimately needs to be more structured - ie first break the URL into pieces, 
then apply rules against the pieces. Trying to craft regular expressions to 
handle target cases leads to big, hairy, hard-to-understand strings.

2. URL normalization is something that makes a lot of sense for 
crawler-commons. If somebody from the Nutch side wants to define a target API, 
I could look at porting existing Bixo code to crawler-commons.


 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 1.1 release?

2010-03-31 Thread Mattmann, Chris A (388J)
Hey Guys,

OK I'm finally getting around to this: I am going to push all the current 1.1 
JIRA issues out and set their fix version to nil. Once I'm done with this, 
I'll wait 48 hrs to see if there is anything that anyone really wants to get 
into 1.1. So, please, take a look here [1] and make sure that if you wanted 
your issue into 1.1, that it's there.

After 48 hours, I'll make one more announcement, and wait 24 hours before 
cutting the 1.1 RC and pushing to people.a.o for review. Here I go!

Cheers,
Chris



[1] http://bit.ly/cNehBc


On 3/9/10 10:54 AM, Andrzej Bialecki a...@getopt.org wrote:

On 2010-03-09 18:17, Julien Nioche wrote:
 Hi Chris,

 Excellent idea! There have been quite a few changes since 1.0 and it's
 probably the right time to have a new release.

+1. Let's just check JIRA and make sure we didn't forget anything
important ...


 Not really a blocker but https://issues.apache.org/jira/browse/NUTCH-762
 would be nice to have in 1.1, just needs a bit of reviewing / testing I
 suppose. Otherwise this can wait until after 1.1

I'll try to test it before the weekend.

--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] Updated: (NUTCH-249) black- white list url filtering

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-249:


Fix Version/s: (was: 1.1)

- push out per http://bit.ly/c7tBv9

 black- white list url filtering
 ---

 Key: NUTCH-249
 URL: https://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Dennis Kubes
Priority: Trivial
 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch


 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-309) Uses commons logging Code Guards

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-309:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Uses commons logging Code Guards
 

 Key: NUTCH-309
 URL: https://issues.apache.org/jira/browse/NUTCH-309
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-763:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Separate configuration files from resources to be included in the job file
 --

 Key: NUTCH-763
 URL: https://issues.apache.org/jira/browse/NUTCH-763
 Project: Nutch
  Issue Type: Wish
Reporter: Julien Nioche
Priority: Minor

 One of the things I found confusing when I was learning Nutch was the fact 
 that the conf/ directory contains at the same time : 
 - configuration files for Hadoop / Nutch which are put in the jar files but 
 not used there
 - resource files (e.g. filtering rules) which MUST be up to date in the job 
 file
 I would separate the conf/ directory from say a resources/ directory which 
 would contain the rule files and other things to put in the job file. Unless 
 I am mistaken none of the configuration files need to be in the job file. I 
 know it is a very minor point, but that would probably simplify things and 
 make it easier for beginners to understand what has to be modified where. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-577:


 Due Date: 30/Nov/07  (was: 30/Nov/07)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Use explicit tika-config.xml file to enable mime magic detection to be turned 
 on and off
 

 Key: NUTCH-577
 URL: https://issues.apache.org/jira/browse/NUTCH-577
 Project: Nutch
  Issue Type: Improvement
  Components: mime_type_detector
Affects Versions: 1.0.0
 Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS 
 X 10.4, although improvement is indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor

 Currently, there is a configuration file for Tika (which the trunk in Nutch 
 uses for its mime type detection) called tika-config.xml left unexposed (a 
 default one lives in the tika-0.1-dev.jar file). Tika's mime system has two 
 config files it relies on: tika-mimetypes.xml (which Nutch has its own 
 version of, that overrides the version that comes with the tika jar file), 
 and tika-config.xml (to turn on or off magic char detection). We should 
 probably have a nutch version of tika-config.xml, so that Nutch users can 
 employ magic char mime detection. I'll get going on this in the next day or 
 so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-310) Review Log Levels

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-310:


Fix Version/s: (was: 1.1)
 Assignee: Chris A. Mattmann  (was: Jerome Charron)

- pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can 
be closed but will wait until after 1.1 to revisit)

 Review Log Levels
 -

 Key: NUTCH-310
 URL: https://issues.apache.org/jira/browse/NUTCH-310
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Review of logs content and logs levels (see Commons Logging Best Parctices : 
 http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-673:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-664) Possibility to update already stored documents.

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-664:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Possibility to update already stored documents.
 ---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: Wish
Reporter: Sergey Khilkov
Priority: Minor

 We have huge index of stored documents. It is high cost procedure to fetch 
 page, merge indexes any time we update some information about page. The 
 information can be changed 1-3 times per day. At this moment we have to store 
 changed info in database, but in this case we have lots of problems with 
 sorting, search restricions and so on. Lucene itself allows delete single 
 document and add new one into existing index. But there is a problem with 
 hadoop... As I understand hadoop filesystem has no possibility to write in 
 random positions. But it will be great feature if nutch will be able to 
 update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-750:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 HtmlParser plugin - page title extraction
 -

 Key: NUTCH-750
 URL: https://issues.apache.org/jira/browse/NUTCH-750
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Alexey Torochkov
Priority: Minor
 Attachments: SkipBody.patch


 A little improvement to trying to extract title tag in body if it doesn't 
 exist in head.
 In current version DOMContentUtils just skip all after body in getTitle() 
 method.
 Attached patch allows to change this behavior (for default it doesn't change 
 anything) and can cope with webmasters mistakes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-564) External parser supports encoding attribute

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-564:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 External parser supports encoding attribute
 ---

 Key: NUTCH-564
 URL: https://issues.apache.org/jira/browse/NUTCH-564
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Antony Bowesman
Priority: Minor
 Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch


 When an external component generates text, which is returned to the external 
 parser, it always converts the text using the default character set.  
 (os.toString()).  For example, the returned text may be utf-8, but will not 
 be converted to a String correctly.
 I added the attribute encoding to the implementation XML in plugin.xml 
 and this is then used to convert the text.
 I have tested my original fix on my local 0.9 and include a patch, but have 
 also made an untested patch for trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-477:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-251) Administration GUI

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-251:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to 
get this into 1.2)

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-609:


 Due Date: 13/Feb/08  (was: 13/Feb/08)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Allow Plugins to be Loaded from Jar File(s)
 ---

 Key: NUTCH-609
 URL: https://issues.apache.org/jira/browse/NUTCH-609
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Attachments: NUTCH-609-1-20080212.patch


 Currently plugins cannot be loaded from a jar file.  Plugins must be unzipped 
 in one or more directories specified by the plugin.folders config.  I have 
 been thinking about an extension to PluginRepository or PluginManifestParser 
 (or both) that would allow plugins to packaged into multiple independent jar 
 files and placed on the classpath.  The system would search the classpath for 
 resources with the correct folder name and would load any plugins in those 
 jars.
 This functionality would be very useful in making the nutch core more 
 flexible in terms of packaging.  It would also help with web applications 
 where we don't want to have a plugins directory included in the webapp.
 Thoughts so far are unzipping those plugin jars into a common temp directory 
 before loading.  Another option is using something like commons vfs to 
 interact with the jar files.  VFS essential uses a disk based temporary cache 
 for jar files, so it is pretty much the same solution.   What are everyone 
 else's thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-794.
-

Resolution: Fixed

@julien -- I think this issue has been fixed in Tika right? If not, feel free 
to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. 
Thanks!

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-578:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Dennis Kubes
 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
 NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, 
 regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-540) some problem about the Nutch cache

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-540:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Attachments: 1.gif, 1186733525.jpg


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-455:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-747) injectIndex metadatas and inherit these metadatas to all matching suburls

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-747:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 injectIndex metadatas and inherit these metadatas to all matching suburls
 --

 Key: NUTCH-747
 URL: https://issues.apache.org/jira/browse/NUTCH-747
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, injector
Reporter: Marko Bauhardt
 Attachments: index-metadata.patch, metadata.patch


 Hi.
 the following two patches supports
 + inject metadatas to url's into a metadatadb
 url.com TAB METAKEY : TAB METAVALUE TAB METAVALUE METAKEY : 
 METAVALUE ...
 ...
 + updates the parse_data metadata from a shard and write the metadatas to all 
 fetched urls that starts with an url from the metadatadb
 + this patch support's metadata to all matching suburls inheritance
 the second patch implements a index-metadata plugin.
 + this plugin extract all metadats from the parse_data of a shard and index 
 it. which metadats you can configure in the plugin.properties.
 + to index for example the lang you have to configure the plugin.properties: 
 lang=STORE,UNTOKENIZED
 + that means that the index plugin exract metadata values with key lang. if 
 exists, all values are indexed stored and untokenized
 Example
 create start url's in /tmp/urls/start/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/index.html
 http://lucene.apache.org/nutch/apidocs-0.9/index.html
 create metadata url's in /tmp/urls/metadata/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0
 http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9
 Inject Urls
 bin/nutch inject crawldb /tmp/urls/start/
 bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb 
 /tmp/urls/metadata/
 Fetch  Parse  Update
 bin/nutch generate crawldb segments
 bin/nutch fetch segments/20090806105717/
 bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb 
 segments/20090806105717
 bin/nutch updatedb crawldb/ segments/20090806105717/
 Fetch  Parse  Update Again
 ...
 Index
 bin/nutch invertlinks linkdb -dir segments/
 bin/nutch index index crawldb/ linkdb/ segments/20090806105717 
 segments/20090806110127
 Check your Index
 All urls starting with http://lucene.apache.org/nutch/apidocs-1.0/  are 
 indexed with version:1.0.
 All urls starting with http://lucene.apache.org/nutch/apidocs-0.9/  are 
 indexed with version:0.9.
 This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-479) Support for OR queries

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-479:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: nutch_0.9_OR.patch, or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-677:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-774:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Patch Info: [Patch Available]

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-729:


 Due Date: 26/Mar/09  (was: 26/Mar/09)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 NPE in FieldIndexer when BasicFields url doesn't exist
 --

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-729-1-20090235.patch


 There is a NullPointerException during a logging call in FieldIndexer when 
 there isn't a url for a document.  Documents shouldn't be without urls but 
 since the FieldIndexer doesn't validate fields it is possible for it to 
 occur.  Most often this happens when BasicFields is run with the wrong 
 segments directory and doesn't complain.  It could also occur if using the 
 FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-573:



- pushing this out per http://bit.ly/c7tBv9

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-717:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren

 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-541) Index url field untokenized

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-541:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Index url field untokenized
 ---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
 untokenized version of the url field in some contexts : 
 1. For deleting duplicates by url (at search time). see NUTCH-455
 2. For restricting the search to a certain url (may be used in the case of 
 RSS search where each entry in the Rss is added as a distinct document with 
 (possibly) same url ) 
query-url extends FieldQueryFilter so: 
 Query: url:http://www.apache.org/
 Parsed: url:http http-www http-www-apache www www-apache apache org
 Translated: +url:http-http-www http-www-http-www-apache 
 http-www-apache-www www-www-apache www-apache apache org
 3. for accessing a document(s) in the search servers in the search servers. 
 (using query plugin)
 I suggest we add url as in index-basic and implement a query-url-untoken 
 plugin. 
 doc.add(new Field(url, url.toString(), Field.Store.YES, 
 Field.Index.TOKENIZED));
 doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
 Field.Index.UN_TOKENIZED));

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-628:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-650) Hbase Integration

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-650:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
 NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-583) FeedParser empty links for items

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-583:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


 Due Date: 27/Nov/08  (was: 27/Nov/08)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


Patch Info: [Patch Available]

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-475) Adaptive crawl delay

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-475:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Adaptive crawl delay
 

 Key: NUTCH-475
 URL: https://issues.apache.org/jira/browse/NUTCH-475
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Doğacan Güney
 Attachments: adaptive-delay_draft.patch


 Current fetcher implementation waits a default interval before making another 
 request to the same server (if crawl-delay is not specified in robots.txt). 
 IMHO, an adaptive implementation will be better. If the server is under 
 little load and can server requests fast, then fetcher can ask for more pages 
 in a given interval. Similarly, if the server is suffering from heavy load, 
 fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-771:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Add WebGraph classes to the bin/nutch script
 

 Key: NUTCH-771
 URL: https://issues.apache.org/jira/browse/NUTCH-771
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All, shell script
Reporter: Dennis Kubes
Assignee: Dennis Kubes

 Currently the webgraph jobs are called on the command line by calling main 
 methods on their classes.  I propose to upgrade the bin/nutch shell script to 
 allow calling these jobs as well.  This would include the webgraphdb, 
 linkrank, scoreupdater, and nodedumper jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852047#action_12852047
 ] 

Chris A. Mattmann commented on NUTCH-673:
-

Folks: if you get time to put together a patch for 1.1 or feel that this should 
go into 1.1, please see:  http://bit.ly/c7tBv9 and comment in the next 48 hrs...

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852048#action_12852048
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. 
Once I do that, we can try and close out this issue for 1.1. I should be able 
to do this before the 48 hr deadline I threw up for Nutch 1.1...

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852101#action_12852101
 ] 

Chris A. Mattmann commented on NUTCH-794:
-

Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If 
the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 
release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 
after...thoughts?

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.