Re: question about ObjectCache
On 10/04/2012 05:00, Xiaolong Yang wrote: Hi,all I'm reading source code of nutch and I have some puzzled about the ObjectCache.java in package org.apache.nutch.util.I just find it may be little benefit to use it in urlnormalizers and urlfiters.I also have read some discuss about cache in Nutch-169 and Nutch-501.But I can't understand it. Can anyone tell me where ObjectCache be used and get a good benefit in nutch ? ObjectCache is designed to cache ready-to-use instances of Nutch plugins. The process of finding, instantiating and initializing plugins is inefficient, because it involves parsing plugin descriptors, initializing plugins, collecting the ones that implement correct extension points, etc. It would kill performance if this process were invoked each time you want to run all plugins of a given type (e.g. URLNormalizer-s). The facade URLNormalizers/URLFilters and others make sure that plugin instances of a given type are initialized once per lifetime of a JVM, and then they are cached in ObjectCache, so that next time you want to use them they can be retrieved from a cache, instead of going again through the process of parsing/instantiating/initializing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
On 02/03/2012 12:45, Lewis John Mcgibbney wrote: Hi Guys, As there were some comments on the user list, I recently got digging with http redirects then stumbled across NUTCH-1042. Although these are individual issues e.g. redirects and crawl delays, I think they are certainly linked, however what is interesting is that users 'usually' don't consider them to be interlinked as such and therefore struggle to debug how and why either the redirect or the crawl delay pages are not being fetched. Doing some more digging I found the now rather old and tatty NUTCH-475, which obviously got me thinking about how we maintain the AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking about the following - Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042 still needs fixed as this is obviously becoming a bit of a pain for some users. Yes. - Can someone shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) Fetcher2 is the current Fetcher. The original Fetcher was temporarily renamed OldFetcher and then removed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927 ] Andrzej Bialecki commented on NUTCH-1201: -- I agree that there are situations where you might want a custom fetcher (e.g. depth-first crawling), and it would be good to come up with some more specific API than just MapRunner. I'm not convinced yet that providing interfaces (or rather abstract classes) for the existing plumbing in Fetcher is a good idea - let's figure out first whether this code is reusable at all for some other fetching strategies, because if it's not then providing custom queue impls. may offer little value, and perhaps customization should be implemented on a different level. Re. thread spinning - I haven't seen yet an unequivocal case that would prove that crawl contention is caused by the thread mgmt in Fetcher. Usually on closer look the bottleneck turned out to lie elsewhere (network io, remote throttling, dns lookups, politeness rules, etc). Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186212#comment-13186212 ] Andrzej Bialecki commented on NUTCH-1247: -- Indeed, line 264 increases the retry counter, but after it reaches retryMax then page status is set to DB_GONE, so it won't be generated again until it expires, and its retry counter won't increase. Once it expires then Generator should invoke FetchSchedule.forceRefetch on this page, and the default implementation resets the retry counter. So either there's some bug in this cycle, or your retryMax is greater than 127. CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185908#comment-13185908 ] Andrzej Bialecki commented on NUTCH-1247: -- Originally the reason for a byte was compactness, but we can get the same effect using vint. Markus, something seems off in your setup if you get such high values of retries ... usually CrawlDbReducer will set STATUS_DB_GONE if the number of retries reaches db.fetch.retry.max, so the page will not be tried again until FetchSchedule.forceRefetch resets its status (and the number of retries). CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.5 CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Build failed in Jenkins: Nutch-trunk #1706
On 28/12/2011 12:00, Lewis John Mcgibbney wrote: Hi Guys, Pretty strange compilation failure, this test class hasn't been hacked in months, and from the surface, having looked at the test case there appears to be no obvious reasons for it failing to compile. I've kick started another build on Jenkins to see if it will resolve itself. I don't think it will - I can reproduce this failure locally. Here's what fixed the failure for me (I'm pretty ignorant about ivy/maven so there's likely a more correct fix for this): Index: ivy/ivy.xml === --- ivy/ivy.xml (revision 1225046) +++ ivy/ivy.xml (working copy) @@ -69,7 +69,7 @@ !--Configuration: test -- !--artifacts needed for testing -- - dependency org=junit name=junit rev=3.8.1 conf=test-default / + dependency org=junit name=junit rev=3.8.1 conf=*-default / dependency org=org.apache.hadoop name=hadoop-test rev=0.20.205.0 conf=test-default / -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Build failed in Jenkins: Nutch-trunk #1706
On 28/12/2011 14:15, Lewis John Mcgibbney wrote: Hi Andrzej, Can anyone confirm? I've tried this patch locally and although I couldn't reproduce the original issue, it seems to be working fine for me as well. Check your lib/ dir, maybe you have a local copy of junit jar that gets pulled on the classpath and masks the issue? this happened to me once or twice... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 15/12/2011 13:13, Markus Jelsma wrote: hmm, i don't see how i can use the old mapred MapOutputFormat API with the new Job API. job.setOutputFormatClass(MapFileOutputFormat.class) expects an the mapreduce.lib.output.MapFileOutputFormat class and won't accept the old API. setOutputFormatClass(java.lang.Class? extends org.apache.hadoop.mapreduce.OutputFormat) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Classorg.apache.hadoop.mapred.MapFileOutputFormat) In short, i don't know how i can migrate jobs to the new API on 0.20.x without having MapFileOutputFormat present in the new API. Trying to set to old mapoutputformat Ah, no, that's now what I meant ... of course you need to change the code to use the new api, and the new code will look quite different :) my point was only that it is different in a consistent way, so after you've ported one or two classes the other ones are easy to convert, too... I'm bogged with other work now, but I'll see if I can prepare an example later today... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 14/12/2011 16:01, Markus Jelsma wrote: This is highly annoying, MapFileOutputFormat is not present in the MapReduce API until 0.21! AFAIK that's not the case ... there is both an old api and a new api implementation (the old one is deprecated). The new api is in org.apache.hadoop.mapreduce.lib.output . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 14/12/2011 18:30, Markus Jelsma wrote: proper link: http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapreduce/lib/output/package-summary.html I thought the goal was to upgrade to 0.22, where this class is present. In 0.20.205 org.apache.hadoop.mapred.MapFileOutputFormat still uses the old api, and it's not deprecated yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Upgrading to Hadoop 0.22.0+
On 13/12/2011 17:42, Lewis John Mcgibbney wrote: Hi Markus, I'm certainly in agreement here. If you like to open a Jira, we can begin the build up a picture of what is required. Lewis On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, To keep up with the rest of the world i believe we should move from the old Hadoop mapred API to the new MapReduce API, which has already been done for the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily done in Ivy but all jobs must be tackled and we have many jobs! Anyone to give pointers and helping hand in this large task? I guess the question is also whether the 0.22 is compatible enough to compile more or less with the existing code that uses the old api. If it does, then we can do the transition gradually, if it doesn't then it's a bigger issue. This is easy to verify - just drop in the 0.22 jars and see if it compiles / tests are passing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Upgrading to Hadoop 0.22.0+
On 13/12/2011 18:04, Markus Jelsma wrote: Hi I did a quick test to see what happens and it won't compile. It cannot find our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but won't run and many tests fail with stuff like. Exception in thread main java.lang.NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException at org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421) Hmm... what's that? I don't see this class (or this package) in the Nutch tree. Also, trunk doesn't use JSON for anything as far as I know. at org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.JsonMappingException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 4 more I think this can be overcome but we cannot hide from the fact that all jobs must be ported to the new API at some point. You did some work on the new API's, did you come across any cumbersome issues when working on it? It was quite some time ago .. but I don't remember anything being really complicated, it was just tedious - and once you've done one class the other classes follow roughly the same pattern. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Resolved] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1213. -- Resolution: Fixed Committed in rev. 1207217, thanks for the review. Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1213: - Attachment: NUTCH-1213.diff Path that implements this functionality. SolrParams can be passed as an URL-like string, for example: {code} nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb -params update.chain=distribfmap.a=links segments/2025105233 {code} Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1213) Pass additional SolrParams when indexing to Solr
[ https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157077#comment-13157077 ] Andrzej Bialecki edited comment on NUTCH-1213 at 11/25/11 10:26 AM: - Patch that implements this functionality. SolrParams can be passed as an URL-like string, for example: {code} nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb -params update.chain=distribfmap.a=links segments/2025105233 {code} was (Author: ab): Path that implements this functionality. SolrParams can be passed as an URL-like string, for example: {code} nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb -params update.chain=distribfmap.a=links segments/2025105233 {code} Pass additional SolrParams when indexing to Solr Key: NUTCH-1213 URL: https://issues.apache.org/jira/browse/NUTCH-1213 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-1213.diff This is a simple improvement of the SolrIndexer. It adds the ability to pass additional Solr parameters that are applied to each UpdateRequest. This is useful when you have to pass parameters specific to a particular indexing run, which are not in Solr invariants for the update handler, and modifying the Solr configuration for each different indexing run is inconvenient. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Dependency Injection
On 23/11/2011 01:02, Andrzej Bialecki wrote: On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack, but I'll try to write that up more formally and send it out to get some kind of feedback. This sounds interesting. As Chris mentioned, the current plugin system is far from ideal, but so far it worked reasonably well. The key functionality that it implements is: * self-discovery of services provided by each plugin, * easy pluggability, by the virtue of dropping super-jars (jars with impl. classes and nested library jars) to a predefined location, * controlled classloader isolation between plugins so that incompatible versions of libraries can be used * but also ability to export specified classes and libraries so that one plugin can use other plugin's exported resources on its classpath. * optional auto-loading of dependent plugins In the past one contributor made a bold attempt to port Nutch to OSGI, and it turned out to be much more complicated than we expected, and with a bigger impact on the way Nutch applications were supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. You may also want to take a look at JSPF (http://code.google.com/p/jspf) which perhaps could be made to satisfy the above requirements without too much refactoring. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dependency Injection
On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack, but I'll try to write that up more formally and send it out to get some kind of feedback. This sounds interesting. As Chris mentioned, the current plugin system is far from ideal, but so far it worked reasonably well. The key functionality that it implements is: * self-discovery of services provided by each plugin, * easy pluggability, by the virtue of dropping super-jars (jars with impl. classes and nested library jars) to a predefined location, * controlled classloader isolation between plugins so that incompatible versions of libraries can be used * but also ability to export specified classes and libraries so that one plugin can use other plugin's exported resources on its classpath. * optional auto-loading of dependent plugins In the past one contributor made a bold attempt to port Nutch to OSGI, and it turned out to be much more complicated than we expected, and with a bigger impact on the way Nutch applications were supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Signature == null ?
On 15/11/2011 20:33, Markus Jelsma wrote: It's back again! Last try if someone has a pointer for this. Cheers After some DB updates, they're gone! Anyone recognizes this phenomenon? On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: Hi guys, I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records and their signatures. I had to add a sanity check on signature to avoid a NPE. I had the assumption any record with such DB_ status has to have a signature, right? Why does roughly 0.0001625% of my records exit without a signature? Now with correct metrics: Why does roughly 0.84% of my records exist without a signature? This could be somehow related to pages that come from redirects so that when they are fetched they are accounted for under different urls, which in turn may confuse the update code in CrawlDbReducer... Do you notice any pattern to these pages? What's their origin? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Persistent problems with Ivy dependencies in Eclipse
On 10/11/2011 04:39, Lewis John Mcgibbney wrote: Gets even more strange, both SWFParser and AutomationURLFilter import additonal depenedencies, however they are not included within thier plugin/ivy/ivy.xml files! Am I missing something here? Most likely these problems come from the initial porting of a pure ant build to an ant+ivy build. We should determine what deps are really needed by these plugins, and sanitize the ivy.xml files so that they make sense - if the existing files can't be untangled we can ditch them and come up with new, clean ones. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1139) Indexer to delete documents
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147722#comment-13147722 ] Andrzej Bialecki commented on NUTCH-1139: -- I suggest renaming the option to -deleteGone, to make it more obvious what it's supposed to do. Indexer to delete documents --- Key: NUTCH-1139 URL: https://issues.apache.org/jira/browse/NUTCH-1139 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1139-1.4-1.patch Add an option -delete to the solrindex command. With this feature enabled documents of the currently processing segment with status FETCH_GONE or FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore. This issue is a follow up of NUTCH-1052. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147723#comment-13147723 ] Andrzej Bialecki commented on NUTCH-1061: -- +1. Migrate MoreIndexingFilter from Apache ORO to java.util.regex - Key: NUTCH-1061 URL: https://issues.apache.org/jira/browse/NUTCH-1061 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1061-1.4-1.patch Here's a migrating resetTitle method to use Apache ORO. There was no unit test for this method so i added it. The test passes with old Apache ORO impl. and with the new j.u.regex impl. Please comment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS
On 05/11/2011 06:44, Mattmann, Chris A (388J) wrote: Hey Guys, I modified the Jenkins jobs that Lewis set up to now: * poll SCM hourly for changes to Nutch * publish Maven snapshots (1.5-SNAPSHOT) and above of Nutch to repository.apache.org Very useful - thanks a lot! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)
[ https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144226#comment-13144226 ] Andrzej Bialecki commented on NUTCH-1196: -- Very nicely done and useful patch! A few cosmetic comments: * a common pattern in Hadoop is to reuse object instances as much as possible, so any places where you use the new operator should be reviewed (e.g. new NutchWritable(...)). * in UrlScoreComparator.compare(o1, o2) you can just use unary minus instead of multiplication by -1. * in DbUpdateMapper you can assign a score of Float.MAX_VALUE to the web page record, this way in DbUpdateReducer.reduce you won't have to iterate over all entries, because the web page record will always come as the first, and we can save some time by skipping the remaining entries. Unless you really want to tally the number of skipped inlinks. Overall the patch looks good, +1 for committing. Update job should impose an upper limit on the number of inlinks (nutchgora) Key: NUTCH-1196 URL: https://issues.apache.org/jira/browse/NUTCH-1196 Project: Nutch Issue Type: Bug Reporter: Ferdy Galema Fix For: nutchgora Attachments: NUTCH-1196.patch Currently the nutchgora branch does not limit the number of inlinks in the update job. This will result in some nasty out-of-memory exceptions and timeouts when the crawl is getting big. Nutch trunk already has a default limit of 10,000 inlinks. I will implement this in nutchgora too. Nutch trunk uses a sorting mechanism in the reducer itself, but I will implement it using standard Hadoop components instead (should be a bit faster). This means: The keys of the reducer will be a {url,score} tuple. *Partitioning* will be done by {url}. *Sorting* will be done by {url,score}. Finally *grouping* will be done by {url} again. This ensures all indentical urls will be put in the same reducer, but in order of scoring. Patch should be ready by tomorrow. Please let me know when you have any comments or suggestions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1195) Add Solr 4x (trunk) example schema
[ https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1195. -- Resolution: Fixed Committed in rev. 1197319. Add Solr 4x (trunk) example schema -- Key: NUTCH-1195 URL: https://issues.apache.org/jira/browse/NUTCH-1195 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 Attachments: schema-solr4.xml The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk some of the class names have been changed, and some field types have been redefined, so if you simply drop this schema into Solr it will cause severe errors and indexing won't work. I propose to add a version of the schema.xml file that is tailored to Solr 4.x so that users can deploy this schema when indexing to Solr trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml
Add statically configured field values to solrindex-mapping.xml --- Key: NUTCH-1197 URL: https://issues.apache.org/jira/browse/NUTCH-1197 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 In some cases it's useful to be able to add to every document sent to Solr a set of predefined fields with static values. This could be implemented on the Solr side (with a custom UpdateRequestProcessor), but it may be less cumbersome to add them on the Nutch side. Example: let's say I have several Nutch configurations all indexing to the same Solr instance, and I want each of them to add its identifier as a field in all documents, e.g. origin=web_crawl_1, origin=file_crawl, origin=unlimited_crawl, etc... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml
[ https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1197: - Attachment: NUTCH-1197.patch Patch with the implementation. I added some javadocs, and a unit test for both the old and the new functionality. Add statically configured field values to solrindex-mapping.xml --- Key: NUTCH-1197 URL: https://issues.apache.org/jira/browse/NUTCH-1197 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 Attachments: NUTCH-1197.patch In some cases it's useful to be able to add to every document sent to Solr a set of predefined fields with static values. This could be implemented on the Solr side (with a custom UpdateRequestProcessor), but it may be less cumbersome to add them on the Nutch side. Example: let's say I have several Nutch configurations all indexing to the same Solr instance, and I want each of them to add its identifier as a field in all documents, e.g. origin=web_crawl_1, origin=file_crawl, origin=unlimited_crawl, etc... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1195) Add Solr 4x (trunk) example schema
Add Solr 4x (trunk) example schema -- Key: NUTCH-1195 URL: https://issues.apache.org/jira/browse/NUTCH-1195 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk some of the class names have been changed, and some field types have been redefined, so if you simply drop this schema into Solr it will cause severe errors and indexing won't work. I propose to add a version of the schema.xml file that is tailored to Solr 4.x so that users can deploy this schema when indexing to Solr trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1195) Add Solr 4x (trunk) example schema
[ https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1195: - Attachment: schema-solr4.xml Add Solr 4x (trunk) example schema -- Key: NUTCH-1195 URL: https://issues.apache.org/jira/browse/NUTCH-1195 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 Attachments: schema-solr4.xml The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk some of the class names have been changed, and some field types have been redefined, so if you simply drop this schema into Solr it will cause severe errors and indexing won't work. I propose to add a version of the schema.xml file that is tailored to Solr 4.x so that users can deploy this schema when indexing to Solr trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127427#comment-13127427 ] Andrzej Bialecki commented on NUTCH-1135: -- A few comments from the author of this monstrosity ;) First, thanks Ferdy for taking time to work with this, it's much appreciated, we need to move forward on this. I agree that ultimately this test should be moved to Gora and become a part of a larger test suite that verifies correctness of concurrent multi-threaded and multi-process operations. However, the immediate purpose of this class was to stress-test the existing Gora versions in usage patterns typical for Nutch, in order to verify that a particular version of Gora is a viable storage layer for Nutch - so the test tries to replicate typical Nutch scenarios. Remember that this has to work not only for a toy crawl in a single JVM in local mode, but also for a fully distributed parallel map-reduce crawl. Consequently: * testMultiThread: tests a scenario of multiple threads in a single JVM all writing to the same storage instance. This replicates a scenario present e.g. in a single Fetcher task. If this test fails (assuming it's properly constructed!) then this means that Gora will fail, perhaps silently (see NUTCH-893), in a fundamental Nutch tool. * testMultiProcess: tests a scenario of multiple processes running in multiple JVMs all writing to the same storage instance. This replicates a scenario of multiple map-reduce tasks all using the same storage config (shared storage, e.g. HSQLDB in server mode), and it's fundamental to all Nutch tools running on a cluster. In map-reduce jobs there are usually many concurrent tasks, and some of them may execute in several copies in parallel (speculative execution) and some others may fail catastrophically without proper cleanup - and Gora backends must just deal with it. If this test fails (again, assuming it's properly constructed and doesn't exceed some OS capabilities of the test machine, or some known limits of a storage impl. like the number of concurrent connections) then it means that Gora storage is not reliable for a typical map-reduce usage, which sort of defeats the point of using it at all. To summarize: I think the patch in its current form helps the tests pass, but I don't think it addresses the underlying problems in Gora (or perhaps the problems with HSQL backend), rather it hides the problem. After all, we want the test to mean something if it passes, to verify that we can use Gora for more than a toy crawl, with guarantees of correctness in presence of concurrent updates. If the above errors don't indicate issues with Gora, but instead are caused by exceeded OS or hsql limits, or hsql misconfiguration, then of course we should fix the configs and adjust the numbers so that they make sense. But with the proper config and proper numbers both tests should pass, otherwise we can't be sure that Gora is working properly at all. Fix TestGoraStorage for Nutchgora - Key: NUTCH-1135 URL: https://issues.apache.org/jira/browse/NUTCH-1135 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: nutchgora Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch This issue is part of a larger target which aims to fix broken JUnit tests for Nutchgora -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora
[ https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127470#comment-13127470 ] Andrzej Bialecki commented on NUTCH-1135: -- bq. if you prefer to keep the old TestGoraStorage structure Not really, I'm not against cleanup / breaking it up - if it makes sense let's go for it. My main concern was that by skipping the multi-process test altogether we would ignore testing a part of Gora functionality that is critical to Nutch (well, to any other map-reduce app, too, but we're doing Nutch here ;) ). Thank you for your persistence. bq. By the way, I tested the testMultithreaded with a DataStore that is not thread safe Excellent! Fix TestGoraStorage for Nutchgora - Key: NUTCH-1135 URL: https://issues.apache.org/jira/browse/NUTCH-1135 Project: Nutch Issue Type: Sub-task Components: storage Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: nutchgora Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch This issue is part of a larger target which aims to fix broken JUnit tests for Nutchgora -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125712#comment-13125712 ] Andrzej Bialecki commented on NUTCH-797: - That's unexpected :) I checked the patch and I can't see where the bug could be ... Did you make sure that your config is correct, and that the job actually sees the right value of this property in the config (check the job.xml via JobTracker)? TestDOMContentUtils indicates that it should work, so we need to make sure that the flag has correct value. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target
Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
On 12/10/2011 13:17, Markus Jelsma (Commented) (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125717#comment-13125717 ] Markus Jelsma commented on NUTCH-797: - This test was on a local instance. I tried both values for parser.fix.embeddedparams with: $ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek Is this how it should be implemented? I'm not sure. Embedded params are a bit puzzling :) Hmm ... if that's the exact command-line expression that you entered then if you are using a *nix shell the semicolon would mean the end of command, so in fact what was executed would be: $ bin/nutch parsechecker http://www.funkybabes.nl/ ...lots of output ... bash: ROOOWAN/fotoboek: command not found -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125916#comment-13125916 ] Andrzej Bialecki commented on NUTCH-1097: -- +1, the latest patch looks good. application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml Key: NUTCH-1097 URL: https://issues.apache.org/jira/browse/NUTCH-1097 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Ferdy Priority: Minor Fix For: 1.4 Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch, NUTCH-1097-v4.patch The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125931#comment-13125931 ] Andrzej Bialecki commented on NUTCH-1142: -- +1, the patch looks good. (There is one philosophical :) aspect of this change, as with any situation where you calculate PageRank in presence of URL filtering: does it matter that a page was linked to from other pages that you decided to filter out? I.e. in Pagerank the relative page importance is a function of in-degree, and by filtering out incoming links you change the in-degree. This essentially means that you decide to ignore some evidence of a page being possibly more important, due to links from pages that may not be interesting to you but which still do exist. OTOH the incoming links may have been spam, so one would expect that in the grand picture it evens out.) Normalization and filtering in WebGraph --- Key: NUTCH-1142 URL: https://issues.apache.org/jira/browse/NUTCH-1142 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch, NUTCH-1142-1.5-3.patch The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124737#comment-13124737 ] Andrzej Bialecki commented on NUTCH-797: - The fixup code in Tika is still a private method in HtmlParser, so in this case the upgrade to Tika 0.10 won't help, we still have to apply the above patch. I'll commit this shortly. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.4, nutchgora Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String
[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125016#comment-13125016 ] Andrzej Bialecki commented on NUTCH-797: - Uhh, sorry - I'll fix this in a moment. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1) + { + baseRightMost
[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125077#comment-13125077 ] Andrzej Bialecki commented on NUTCH-797: - I'm puzzled by the algorithm in fixEmbeddedParams (which was refactored into URLUtil), and I don't understand how it was ever supposed to work. If I enable this method then most of the test URLs in TestURLUtil fail, because they are not resolved according to the RFC. In your example in NUTCH-1115, what was the expected result of resolving the base url http://www.funkybabes.nl/;ROOOWAN/fotoboek; and e.g. a target of forumregels ? * http://www.funkybabes.nl/forumregels * http://www.funkybabes.nl/;ROOOWAN/forumregels * http://www.funkybabes.nl/forumregels;ROOOWAN * none of the above ;) parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar
[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
[ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-797: Attachment: NUTCH-797.patch Tentative patch, which changes the meaning of fixEmbeddedParams to removeEmbeddedParams. parse-tika is not properly constructing URLs when the target begins with a ? -- Key: NUTCH-797 URL: https://issues.apache.org/jira/browse/NUTCH-797 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Win 7, Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Also repro's on RHEL and java 1.4.2 Reporter: Robert Hohman Assignee: Andrzej Bialecki Priority: Minor Fix For: nutchgora Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch This is my first bug and patch on nutch, so apologies if I have not provided enough detail. In crawling the page at http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are links in the page that look like this: a href=?co=0sk=0p=2pi=12/a/tdtda href=?co=0sk=0p=3pi=13/a in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as getOutlinks looks for links, it comes across this link, and constucts a new url with a base URL class built from http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a target of ?co=0sk=0p=2pi=1 The URL class, per RFC 3986 at http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how to merge these two, and per the RFC, the URL class merges these to: http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1 because the RFC explicitly states that the rightmost url segment (the Search.aspx in this case) should be ripped off before combining. While this is compliant with the RFC, it means the URLs which are created for the next round of fetching are incorrect. Modern browsers seem to handle this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure exception or handling of what is a poorly formed url on accenture's part. I have fixed this by modifying DOMContentUtils to look for the case where a ? begins the target, and then pulling the rightmost component out of the base and inserting it into the target before the ?, so the target in this example becomes: Search.aspx?co=0sk=0p=2pi=1 The URL class then properly constructs the new url as: http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1 If it is agreed that this solution works, I believe the other html parsers in nutch would need to be modified in a similar way. Can I get feedback on this proposed solution? Specifically I'm worried about unforeseen side effects. Much thanks Here is the patch info: Index: src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java === --- src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (revision 916362) +++ src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java (working copy) @@ -299,6 +299,50 @@ return false; } + private URL fixURL(URL base, String target) throws MalformedURLException + { + // handle params that are embedded into the base url - move them to target + // so URL class constructs the new url class properly + if (base.toString().indexOf(';') 0) + return fixEmbeddedParams(base, target); + + // handle the case that there is a target that is a pure query. + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on how to assemble + // URLs but I've seen this in numerous places, for example at + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 + // It has urls in the page of the form href=?co=0sk=0pg=1, and by default + // URL constructs the base+target combo as + // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, incorrectly + // dropping the Search.aspx target + // + // Browsers handle these just fine, they must have an exception similar to this + if (target.startsWith(?)) + { + return fixPureQueryTargets(base, target); + } + + return new URL(base, target); + } + + private URL fixPureQueryTargets(URL base, String target) throws MalformedURLException + { + if (!target.startsWith(?)) + return new URL(base, target); + + String basePath = base.getPath(); + String baseRightMost=; + int baseRightMostIdx = basePath.lastIndexOf(/); + if (baseRightMostIdx != -1
[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125414#comment-13125414 ] Andrzej Bialecki commented on NUTCH-1097: -- +1 the idea makes sense. Patch looks good, but it needs a minor fix - mime types may contain also . characters, e.g. application/vnd.ms-word, and these need to be escaped too. application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml Key: NUTCH-1097 URL: https://issues.apache.org/jira/browse/NUTCH-1097 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Ferdy Priority: Minor Fix For: 1.4 Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, NUTCH-1097-v3.patch The configuration in parse-plugins.xml expects the parse-html plugin to accept application/xhtml+xml, however the plugin.xml of this plugin does not list this type. Either change the entry in parse-plugins.xml or change the parse-html plugin.xml. I suggest the latter. See patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1154) Upgrade to Tika 0.10
[ https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124428#comment-13124428 ] Andrzej Bialecki commented on NUTCH-1154: -- TIKA-748 has been fixed and is scheduled to be included in Tika 1.0. If there are not objections I'd like to commit Tika 0.10, put a comment in CHANGES.txt, and disable this part of the test until we upgrade to Tika 1.0. Upgrade to Tika 0.10 Key: NUTCH-1154 URL: https://issues.apache.org/jira/browse/NUTCH-1154 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Andrzej Bialecki Attachments: NUTCH-1154.diff There have been significant improvements in Tika 0.10 and it would be nice to use the latest Tika in 1.4. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1152) Upgrade to SolrJ 3.4.0
Upgrade to SolrJ 3.4.0 -- Key: NUTCH-1152 URL: https://issues.apache.org/jira/browse/NUTCH-1152 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Fix For: 1.4 Current release of Lucene/Solr is 3.4.0, but we're still using 3.1.0. The fix is trivial, just replace 3.1.0 with 3.4.0 in ivy.xml. If there are no objections I'll make the change shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1152) Upgrade to SolrJ 3.4.0
[ https://issues.apache.org/jira/browse/NUTCH-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-1152. -- Resolution: Fixed Assignee: Andrzej Bialecki Committed in rev. 1180087. This commit also upgrades SLF4J as a dependency of SolrJ, to release 1.6.1. Upgrade to SolrJ 3.4.0 -- Key: NUTCH-1152 URL: https://issues.apache.org/jira/browse/NUTCH-1152 Project: Nutch Issue Type: Improvement Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.4 Current release of Lucene/Solr is 3.4.0, but we're still using 3.1.0. The fix is trivial, just replace 3.1.0 with 3.4.0 in ivy.xml. If there are no objections I'll make the change shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1154) Upgrade to Tika 0.10
Upgrade to Tika 0.10 Key: NUTCH-1154 URL: https://issues.apache.org/jira/browse/NUTCH-1154 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Andrzej Bialecki There have been significant improvements in Tika 0.10 and it would be nice to use the latest Tika in 1.4. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1154) Upgrade to Tika 0.10
[ https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-1154: - Attachment: NUTCH-1154.diff Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser fails with this version of Tika - the extracted body of the text is empty. See TIKA-748. Still, I think the improvements in PDF and Office parsers are worth the upgrade. Upgrade to Tika 0.10 Key: NUTCH-1154 URL: https://issues.apache.org/jira/browse/NUTCH-1154 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Andrzej Bialecki Attachments: NUTCH-1154.diff There have been significant improvements in Tika 0.10 and it would be nice to use the latest Tika in 1.4. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1124) JUnit test for scoring-opic
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120982#comment-13120982 ] Andrzej Bialecki commented on NUTCH-1124: -- Our implementation is most definitely inaccurate (broken?), though I'm not sure if the original OPIC algorithm is better. The original OPIC paper explains that each node needs to give away all its cash, and then receive cash from other nodes, but in their experiments this led to a yo-yo instability of large amounts of cash floating in and out, in response to changes in the graph and the fact that there is a delay of a full re-crawl cycle, i.e. all known urls need to be re-crawled in order to collect and redistribute all cash that is potentially floating in the graph. In order to dampen this effect they added buffering - a history of the latest N scores, and they would consider an average of these scores. This resulted in smoothing and dampening of changes, but it's an artificial hack that is sensitive to the dynamics of changes in the webgraph and the speed of re-crawl. Our implementation of OPIC doesn't give away cash at all, instead it duplicates it and then distributes, which causes the total amount of cash floating in a webgraph to double in each cycle even when a graph is static. We could fix this by giving away all cash and then introducing a mechanism to collect all cash from dangling nodes (without outlinks) to redistribute it evenly to all nodes. This would bring us closer to the original OPIC without smoothing. Still, I expect the same instability would occur, especially in the face of a changing graph. JUnit test for scoring-opic --- Key: NUTCH-1124 URL: https://issues.apache.org/jira/browse/NUTCH-1124 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Move 2.0 out of trunk
On 18/09/2011 02:21, Julien Nioche wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. +1 - at this time it's clear that 2.0 didn't pan out as we expected, and we should restart from the 1.x for a usable platform, and continue redesign from that codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089405#comment-13089405 ] Andrzej Bialecki commented on NUTCH-1087: -- IIRC we had this discussion in the past... It's true that we already rely on Bash to do anything useful, no matter whether it's on Windows or on a *nix-like OS. And it's true that the crawl command has been a constant source of confusion over the years. The crawl application also suffered from some subtle bugs, especially when running in local mode (e.g. the PluginRepository leaks). But the argument about maintenance costs is IMHO moot - you have to maintain a shell script, too, so it's no different from maintaining a Java class. Where it differs, I think, is that moving the crawl cycle logic to a shell script now raises the bar for Java developers who are not familiar with Bash scripting - a robust crawl script is not easy to follow, as it needs to handle error conditions and manage input/output resources on HDFS. On the other hand it's easier for system admins to tweak a script rather than tweaking a Java code... so I guess it's also a question of who's the audience for this functionality. I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't change the picture in any significant way. Deprecate crawl command and replace with example script --- Key: NUTCH-1087 URL: https://issues.apache.org/jira/browse/NUTCH-1087 Project: Nutch Issue Type: Task Affects Versions: 1.4 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4 * remove the crawl command * add basic crawl shell script See thread: http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067972#comment-13067972 ] Andrzej Bialecki commented on NUTCH-1014: -- java.util.regex has the advantage of being a part of the JRE. However, it is quite slow for more complex regexes. See e.g. this benchmark: http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger crawls this is especially important when using regexes for URL filtering and normalization - an innocent-looking regex can melt the cpu when processing a 64kB long junk URL, and consequently it can stall the crawl... In such cases it's good to have an option to fall back to a subset of regex features and use a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex (but also it isn't maintained anymore). Brics lacks support for many operators, but it's fast. Perhaps ICU4j would be a good alternative - it's fully JDK-compatible and offers good performance. Migrate from Apache ORO to java.util.regex -- Key: NUTCH-1014 URL: https://issues.apache.org/jira/browse/NUTCH-1014 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Fix For: 1.4, 2.0 A separate issue tracking migration of all components from Apache ORO to java.util.regex. Components involved are: - RegexURLNormalzier - OutlinkExtractor - JSParseFilter - MoreIndexingFilter - BasicURLNormalizer -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr
[ https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034724#comment-13034724 ] Andrzej Bialecki commented on NUTCH-985: - We should use the Solr's DateUtil in all such places, to avoid code duplication and confusion should the date format ever change... The patch does essentially the same what DateUtil does, only the DateUtil reuses SimpleDateFormat instances in a thread-safe way, so it's more efficient. MoreIndexingFilter doesn't use properly formatted date fields for Solr -- Key: NUTCH-985 URL: https://issues.apache.org/jira/browse/NUTCH-985 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3, 2.0 Reporter: Dietrich Schmidt Assignee: Markus Jelsma Fix For: 1.3, 2.0 Attachments: NUTCH-985-trunk-1.patch, NUTCH-985.1.3-1.patch, indexlastmodifieddate.jar I am using the index-more plugin to parse the lastModified data in web pages in order to store it in a Solr data field. In solrindex-mapping.xml I am mapping lastModified to a field changed in Solr: field dest=changed source=lastModified/ However, when posting data to Solr the SolrIndexer posts it as a long, not as a date: adddoc boost=1.0field name=changed107932680/fieldfield name=tstamp20110414144140188/fieldfield name=date20040315/field Solr rejects the data because of the improper data type. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Differences 1.x and trunk
On 3/18/11 4:31 PM, Markus Jelsma wrote: Hi all, I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 to trunk after committing to 1.3. There are of course a lot of differences so i need a little advice on how to procede: - instead of using CrawlDB and CrawlDatum we now need WebTableReader? Actually you need to use StorageUtils to set up Mapper or Reducer contexts. See other tools, e.g. Fetcher or Generator. - trunk uses slf instead of commons logging now? Yes. - a page is now represented by storage.WebPage? Yes. When you prepare a Job you also need to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3
On 3/10/11 10:57 PM, Julien Nioche (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-951. --- NUTCH-825 committed in revision 1080368 All the known improvements from 2.0 have been backported into 1.3 now The only remaining issue to address before rolling out a 1.3 release is NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Resolved: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-951. - Resolution: Fixed Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13004488#comment-13004488 ] Andrzej Bialecki commented on NUTCH-951: - * Ported NUTCH-872 in rev. 1079746. * Ported NUTCH-876 in rev. 1079753. * Ported NUTCH-921 in rev. 1079760. * NUTCH-884 is not applicable to 1.3 because here fetching executes in map tasks, so there's a correct number of them already. Backport changes from 2.0 into 1.3 -- Key: NUTCH-951 URL: https://issues.apache.org/jira/browse/NUTCH-951 Project: Nutch Issue Type: Task Affects Versions: 1.3 Reporter: Julien Nioche Assignee: Andrzej Bialecki Priority: Blocker Fix For: 1.3 I've compared the changes from 2.0 with 1.3 and found the following differences (excluding anything specific to 2.0/GORA) * NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann) * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) * NUTCH-825 Publish nutch artifacts to central maven repository (mattmann) * NUTCH-851 Port logging to slf4j (jnioche) * NUTCH-861 Renamed HTMLParseFilter into ParseFilter * NUTCH-872 Change the default fetcher.parse to FALSE (ab). * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) * NUTCH-880 REST API for Nutch (ab) * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) * NUTCH-886 A .gitignore file for Nutch (dogacan) * NUTCH-894 Move statistical language identification from indexing to parsing step * NUTCH-921 Reduce dependency of Nutch on config files (ab) * NUTCH-930 Remove remaining dependencies on Lucene API (ab) * NUTCH-931 Simple admin API to fetch status and stop the service (ab) * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-962. - Resolution: Fixed Fix Version/s: 2.0 1.3 Assignee: Andrzej Bialecki max. redirects not handled correctly: fetcher stops at max-1 redirects -- Key: NUTCH-962 URL: https://issues.apache.org/jira/browse/NUTCH-962 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.2, 1.3, 2.0 Reporter: Sebastian Nagel Assignee: Andrzej Bialecki Fix For: 1.3, 2.0 Attachments: Fetcher_redir.patch The fetcher stops following redirects one redirect before the max. redirects is reached. The description of http.redirect.max The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. suggests that if set to 1 that one redirect will be followed. I tried to crawl two documents the first redirecting by meta http-equiv=refresh content=0; URL=./to/meta_refresh_target.html to the second with http.redirect.max = 1 The second document is not fetched and the URL has state GONE in CrawlDb. fetching file:/test/redirects/meta_refresh.html redirectCount=0 -finishing thread FetcherThread, activeThreads=1 - content redirect to file:/test/redirects/to/meta_refresh_target.html (fetching now) - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html The attached patch would fix this: if http.redirect.max is 1 : one redirect is followed. Of course, this would mean there is no possibility to skip redirects at all since 0 (as well as negative values) means treat redirects as ordinary links. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-955) Ivy configuration
[ https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-955. - Resolution: Fixed Fix Version/s: 2.0 Assignee: Andrzej Bialecki Ivy configuration - Key: NUTCH-955 URL: https://issues.apache.org/jira/browse/NUTCH-955 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Alexis Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: ivy.patch As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to help setup the Gora backend more easily. If the user does not want to stick with default HSQL database, other alternatives exist, such as MySQL and HBase. org.restlet and xercesImpl versions should be changed as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Gora/HBase dependencies and deploy artifacts
Hi all, Recently I've been deploying Nutch trunk to an already existing Hadoop cluster. And immediately I hit a snag. Nutch was configured to use gora-hbase. The nutch.job jar doesn't include gora-hbase even if it was configured in nutch-site.xml. Furthermore, gora-hbase depends on HBase and its dependencies, which need to be found on classpath. Typically for development and testing I solved this issue by deploying gora-core and gora-hbase + all hbase libs to hadoop/lib across the cluster. This is a bit dirty - Hadoop clusters should be seen as a generic computing fabric, so they should be application-agnostic, besides this creates maintenance ops issues. We could put all these libs in lib/ inside nutch.job, so that they are unpacked and put on classpath during task setup. This would work fine for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that InputFormat / OutputFormat classes were initialized prior to this unpacking - and in our case these depend on the libs in as-yet-unpacked job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in Hadoop 0.20.2, so his is something that needs to be tested). Furthermore, even if we packed the jars in lib/ inside nutch.job, still many tools wouldn't work, because they depend on classes from those libs during the local execution (before the job is sent to task trackers), and the URLClassLoader can't load classes from jars within jars... A workaround for this would be to take all those jars and re-pack them together under / directory in nutch.job. This would satisfy the dependencies for local execution, and for Mapper/Reducer execution but I'm not sure if it solves the problem of Input/OutputFormat-s that I mentioned above. In short, we need a clear working procedure how to deploy Gora backend implementations so that they work with Nutch and with a generic unmodified Hadoop cluster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-939. - Resolution: Fixed Assignee: Andrzej Bialecki I modified the patch slightly to allow more flexibility (you can mix individual segment names and the -dir options) as well as allowing segments placed on different filesystems. Committed in rev. 1051505. Thank you! Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments - Key: NUTCH-939 URL: https://issues.apache.org/jira/browse/NUTCH-939 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3 Reporter: Claudio Martella Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.3 Attachments: Indexer.patch, SolrIndexer.patch The patches add -dir option, so the user can specify the directory in which the segments are to be found. The actual mode is to specify the list of segments, which is not very easy with hdfs. Also, the -dir option is already implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-948) Remove Lucene dependencies
Remove Lucene dependencies -- Key: NUTCH-948 URL: https://issues.apache.org/jira/browse/NUTCH-948 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.3 Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it uses DateTools in index-basic. DateTools should be replaced with Solr's DateUtil, as we did in trunk, and then we can remove Lucene libs as a dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-948) Remove Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-948. - Resolution: Fixed Committed in rev. 1051509. Remove Lucene dependencies -- Key: NUTCH-948 URL: https://issues.apache.org/jira/browse/NUTCH-948 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.3 Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it uses DateTools in index-basic. DateTools should be replaced with Solr's DateUtil, as we did in trunk, and then we can remove Lucene libs as a dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973915#action_12973915 ] Andrzej Bialecki commented on NUTCH-939: - 1.2 release is out, and branch-1.2 is unlikely to result in a subsequent release - most users seem to be interested either in 1.3 or trunk. Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments - Key: NUTCH-939 URL: https://issues.apache.org/jira/browse/NUTCH-939 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3 Reporter: Claudio Martella Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.3 Attachments: Indexer.patch, SolrIndexer.patch The patches add -dir option, so the user can specify the directory in which the segments are to be found. The actual mode is to specify the list of segments, which is not very easy with hdfs. Also, the -dir option is already implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Does Nutch 2.0 in good enough shape to test?
(switching to devs) On 12/17/10 10:18 AM, Alexis wrote: Hi, I've spent some time working on this as well. I've just put together a blog entry addressing the issues I ran into. See http://techvineyard.blogspot.com/2010/12/build-nutch-20.html In a nutchsell, I changed three pieces in Gora and Nutch code: - flush the datastore regularly in the Hadoop RecordWriter (in GoraOutputFormat) Careful here. DataStore flush may be very expensive, so it should be done only when we are finished with the output. If you see that data is lost without this flush then this should be reported as a Gora bug. - wait for Hadoop job completion in the Fetcher job I missed your previous email... I'll fix this shortly - thanks for spotting it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Java.io.IOException with multiple copyField/ directives
On 2010-12-03 09:52, Peter Litsegård wrote: Hi! I've run into a strange behaviour while using Nutch (solrindexer) together with Solr 1.4.1. I'd like to copy the 'title' and 'content' field to another field, say, 'foo'. In my first attempt I added the copyField/ directives in schema.xml and got the java exception so I removed them from schema.xml. In my second attempt I added the copyField/ directives to the 'solrindex-mapping.xml' file and ran into the same exception again! Is this a known issue or have I stumbled into unknown territory? Any workarounds? I suspect that the field type declared in your schema.xml is not multiValued. What was the exception? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936047#action_12936047 ] Andrzej Bialecki commented on NUTCH-939: - Please note that trunk uses a very different method of working with segments (called batches there), and -dir is not applicable there. Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments - Key: NUTCH-939 URL: https://issues.apache.org/jira/browse/NUTCH-939 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Claudio Martella Priority: Minor Fix For: 1.2 Attachments: Indexer.patch, SolrIndexer.patch The patches add -dir option, so the user can specify the directory in which the segments are to be found. The actual mode is to specify the list of segments, which is not very easy with hdfs. Also, the -dir option is already implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-4.patch Final version of the patch. Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-932. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1039014. Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-3.patch NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before... Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928909#action_12928909 ] Andrzej Bialecki commented on NUTCH-880: - Thanks - this issue is already fixed in NUTCH-932, to be committed soon. REST API for Nutch -- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: API-2.patch, API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON. This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST. Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: db.formatted.gz Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...). Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: db.formatted.gz, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928355#action_12928355 ] Andrzej Bialecki commented on NUTCH-932: - Examples (with the db equivalent to the one in db.formatted.gz): {code} $ curl -s 'http://localhost:8192/nutch/db?fields=urlend=http://www.freebsd.org/start=http://www.egothor.org/'| ./json_pp [ { url: http://www.egothor.org/; }, { url: http://www.freebsd.org/; } ] {code} {code} $ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentTypestart=http://www.getopt.org/end=http://www.getopt.org/'| ./json_pp [ { contentType: text/html, url: http://www.getopt.org/;, markers: { _updmrk_: 1288890451-1134865895 }, parseStatus: success/ok (1/0), args=[], protocolStatus: SUCCESS, args=[], outlinks: { http://www.getopt.org/luke/: Luke, http://www.getopt.org/ecimf/contrib/ONTO/REA: REA Ontology page, http://www.getopt.org/CV.pdf: CV here, http://www.getopt.org/utils/build/api: API, http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java: available here, http://www.getopt.org/murmur/MurmurHash.java: MurmurHash.java, http://www.ebxml.org/: ebXML / ebTWG, http://www.freebsd.org/: FreeBSD, http://www.getopt.org/luke/webstart.html: Launch with Java WebStart, http://www.freebsd.org/%7Epicobsd: PicoBSD, http://home.comcast.net/~bretm/hash/6.html: this discussion, http://protege.stanford.edu/: Protege, http://jakarta.apache.org/lucene: Lucene, http://www.getopt.org/ecimf/contrib/ONTO/ebxml: ebXML Ontology, http://www.getopt.org/ecimf/: here, http://www.isthe.com/chongo/tech/comp/fnv/: his website, http://www.getopt.org/stempel/index.html: Stempel, http://www.sigram.com/: SIGRAM, http://www.egothor.org/: Egothor, http://thinlet.sourceforge.net/: Thinlet, http://www.getopt.org/utils/dist/utils-1.0.jar: binary, http://www.ecimf.org/: ECIMF } } ] {code} Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id. Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service
[ https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-931. - Resolution: Fixed Committed in rev. 1028736 with some changes. Simple admin API to fetch status and stop the service - Key: NUTCH-931 URL: https://issues.apache.org/jira/browse/NUTCH-931 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-931.patch REST API needs a simple info / stats service and the ability to shutdown the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Summary: REST API for Nutch (was: REST API (and webapp) for Nutch) The webapp part is tracked now in NUTCH-929. REST API for Nutch -- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: API-2.patch, API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-880. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028235. The webapp part of this issue is tracked now in NUTCH-929. REST API for Nutch -- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: API-2.patch, API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API
Remove remaining dependencies on Lucene API --- Key: NUTCH-930 URL: https://issues.apache.org/jira/browse/NUTCH-930 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic SolrJ API. The only place where we still use a minor part of Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-930: Attachment: NUTCH-930.patch Patch to fix the issue. I'll commit this shortly. Remove remaining dependencies on Lucene API --- Key: NUTCH-930 URL: https://issues.apache.org/jira/browse/NUTCH-930 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-930.patch Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic SolrJ API. The only place where we still use a minor part of Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-930. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028474. Remove remaining dependencies on Lucene API --- Key: NUTCH-930 URL: https://issues.apache.org/jira/browse/NUTCH-930 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-930.patch Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic SolrJ API. The only place where we still use a minor part of Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service
Simple admin API to fetch status and stop the service - Key: NUTCH-931 URL: https://issues.apache.org/jira/browse/NUTCH-931 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 REST API needs a simple info / stats service and the ability to shutdown the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925543#action_12925543 ] Andrzej Bialecki commented on NUTCH-926: - bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!! No need to shout, we hear you :) Indeed, Nutch behavior when following redirects doesn't play well with the rule of ignoring external outlinks. Strictly speaking, redirects are not outlinks, but the silent assumption behind ignoreExternalOutlinks is that we crawl content only from that hostname. And your patch would solve this particular issue. However, this is not as simple as it seems... My favorite example is www.ibm.com - www8.ibm.com/index.html . If we apply your fix you won't be able to crawl www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple equality of hostnames may not be sufficient. We have utilities to extract domain names, so we could compare domains but then we may mistreat money.cnn.com vs. weather.cnn.com ... Nutch follows wrong url in META http-equiv=refresh tag - Key: NUTCH-926 URL: https://issues.apache.org/jira/browse/NUTCH-926 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.2 Environment: gnu/linux centOs Reporter: Marco Novo Priority: Critical Fix For: 1.3 Attachments: ParseOutputFormat.java.patch We have nutch set to crawl a domain urllist and we want to fetch only passed domains (hosts) not subdomains. So WWW.DOMAIN1.COM .. .. .. WWW.RIGHTDOMAIN.COM .. .. .. .. WWW.DOMAIN.COM We sets nutch to: NOT FOLLOW EXERNAL LINKS During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WRONG.RIGHTDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG subdomains! But it should not do this!! During crawling of WWW.RIGHTDOMAIN.COM if a page contains !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN html head title/title META http-equiv=refresh content=0; url=http://WWW.WRONGDOMAIN.COM; /head body /body /html Nutch continues to crawl the WRONG domain! But it should not do this! If that we will spider all the web We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have done a patch so we will attach it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: ReviewBoard Instance
On 2010-10-26 15:53, Mattmann, Chris A (388J) wrote: Hi Guys, Gav from infra@ set up a ReviewBoard instance for Apache [1]. I've never used it before but I thought I'd request an account on it for Nutch [2] regardless, so if folks want to use it, they can. Hmm, I may be missing something... but what's the point of using the tool in our JIRA-based workflow? It looks to me like it duplicates at least part of JIRA's functionality, and the remaining part is what we do also in JIRA by convention... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924659#action_12924659 ] Andrzej Bialecki commented on NUTCH-913: - +1, let's commit it - I want to start playing with GORA-9, and that patch is in the org.apache namespace... Nutch should use new namespace for Gora --- Key: NUTCH-913 URL: https://issues.apache.org/jira/browse/NUTCH-913 Project: Nutch Issue Type: Bug Components: storage Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.0 Attachments: NUTCH-913_v1.patch, NUTCH-913_v2.patch Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace from org.gora to org.apache.gora. This means nutch should use the new namespace otherwise it won't compile with newer builds of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924154#action_12924154 ] Andrzej Bialecki commented on NUTCH-923: - This doesn't solve the problem of potentially unbounded number of fields. Compliance is one thing, and you can clean up field names from invalid characters, but sanity is another thing - if you have {{title_*}} in your Solr schema then theoretically you are allowed to create unlimited number of fields with this prefix - Solr won't complain. Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-924) Static field in solr mapping
[ https://issues.apache.org/jira/browse/NUTCH-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923845#action_12923845 ] Andrzej Bialecki commented on NUTCH-924: - The functionality is useful, +1. But the patch has formatting errors. Please fix them before committing. The same functionality should be added to trunk, too. Static field in solr mapping Key: NUTCH-924 URL: https://issues.apache.org/jira/browse/NUTCH-924 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.3 Reporter: David Stuart Assignee: Markus Jelsma Fix For: 1.3 Attachments: nutch_1.3_static_field.patch Original Estimate: 0h Remaining Estimate: 0h Provide the facility to pass static data defined in solrindex-mapping.xml to solr during the mapping process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923896#action_12923896 ] Andrzej Bialecki commented on NUTCH-923: - This sounds useful, though the implementation needs to keep the following in mind: * you _assume_ that the lang field will have a nice predictable value, but unless you sanitize the values you can't assume anything... example: one page I saw had a language metadata set to a random string 8kB long with various control chars and '\0'-s. * again, if you don't sanitize and control the total number of unique values in the source field, you could end up with a number of fields approaching infinity, and Solr would melt down... Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Build failed in Hudson: Nutch-trunk #1280
On 2010-10-19 06:01, Apache Hudson Server wrote: [Nutch-trunk] $ /bin/bash -xe /tmp/hudson7277994413075810777.sh + PATH=/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/ucb:/usr/local/bin:/usr/bin:/usr/sfw/bin:/usr/sfw/sbin:/opt/sfw/bin:/opt/sfw/sbin:/opt/SUNWspro/bin:/usr/X/bin:/usr/ucb:/usr/sbin:/usr/ccs/bin + export ANT_HOME=/export/home/hudson/tools/ant/latest + ANT_HOME=/export/home/hudson/tools/ant/latest + export PATH ANT_HOME + cd trunk + /export/home/hudson/tools/ant/latest/bin/ant -Dversion=2010-10-19_04-00-41 -Dtest.junit.output.format=xml nightly /tmp/hudson7277994413075810777.sh: line 7: /export/home/hudson/tools/ant/latest/bin/ant: No such file or directory Do you know guys why the automated builds are failing? Looks like Ant is not where the build script expects it to be... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-921: Attachment: NUTCH-921.patch Patch that implements reading config parameters from Configuration, and falls back to config files if Configuration properties are unspecified. Reduce dependency of Nutch on config files -- Key: NUTCH-921 URL: https://issues.apache.org/jira/browse/NUTCH-921 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-921.patch Currently many components in Nutch rely on reading their configuration from files. These files need to be on the classpath (or packed into a job jar). This is inconvenient if you want to manage configuration via API, e.g. when embedding Nutch, or running many jobs with slightly different configurations. This issue tracks the improvement to make various components read their config directly from Configuration properties. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610 ] Andrzej Bialecki commented on NUTCH-913: - There are formatting issues in DomainStatistics.java - the file uses literal tabs, which we frown upon, but the patch introduces double-space indent in the changed lines. As ugly as it sounds I think this should be changed into tabs, and then reformatted in another commit. Other than that, +1, go for it. Nutch should use new namespace for Gora --- Key: NUTCH-913 URL: https://issues.apache.org/jira/browse/NUTCH-913 Project: Nutch Issue Type: Bug Components: storage Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.0 Attachments: NUTCH-913_v1.patch Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace from org.gora to org.apache.gora. This means nutch should use the new namespace otherwise it won't compile with newer builds of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch, this looks very good! A few comments: * I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. * since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. * IndexerMapReduce.createIndexJob: I think it would be useful to pass the datasetId as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well... DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-907.patch In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter. Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: hostdb.patch, NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from the simplification in 2.x as compared to 1.x, in that we keep a single status per page. In 1.x a side-effect of having two locations with two statuses (one db status in crawldb and one fetch status in segments) was that we had more information in updatedb to act upon. Now we should probably keep up to two statuses - one that reflects a temporary fetch status, as determined by fetcher, and a final (reconciled) status as determined by updatedb, based on the knoweldge of not only plain fetch status and old status but also possible redirects. If I'm not mistaken currently the status is immediately overwritten by fetcher, even before we come to updatedb, hence the problem.. Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Nutch 1.2 Release Candidate #4
On 2010-09-24 04:38, Mattmann, Chris A (388J) wrote: Hi Nutch PMC: /nudge Anyone get a chance to review this yet? I have some free cycles tomorrow and would really think it’s cool if I could finally push out the 1.2 RC. I had little time this week, but I'm testing it now... I should be done tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Nutch 1.2 Release Candidate #4
On 2010-09-24 20:40, Mattmann, Chris A (388J) wrote: Thanks Andrzej, appreciate it. I know you’ve been really vigilant with the other RCs I’ve thrown up about testing and I appreciate it. Other Nutch PMC’ers: just need one more VOTE. Help, please? :) +1, all unit tests pass, and a test crawl + indexing to Solr went just fine. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118 ] Andrzej Bialecki commented on NUTCH-880: - bq. I think we can combine the approach you outlined in NUTCH-907 with this one. I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration. bq. What is CLASS ? It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool. bq. Btw, Andrzej, I will be happy to help out with the implementation if you want. By all means - I didn't have time so far to progress beyond this patch... REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site
[ https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474 ] Andrzej Bialecki commented on NUTCH-909: - bq. It might be better to see the message Search with Apache Solr (as on the TIKA's site). Yes, let's make this uniform. Add alternative search-provider to Nutch site - Key: NUTCH-909 URL: https://issues.apache.org/jira/browse/NUTCH-909 Project: Nutch Issue Type: Improvement Components: documentation Reporter: Alex Baranau Priority: Minor Attachments: NUTCH-909.patch Add additional search provider (to existed Lucid Find) search-lucene.com. Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1 According to Andrzej's suggestion, when preparing the patch let's follow the same rationales as those in TIKA-488, since they are applicable here too, so please refer to that issue for more insight on implementation details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-862) HttpClient null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-862: --- Assignee: Andrzej Bialecki HttpClient null pointer exception - Key: NUTCH-862 URL: https://issues.apache.org/jira/browse/NUTCH-862 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: linux, java 6 Reporter: Sebastian Nagel Assignee: Andrzej Bialecki Priority: Minor Attachments: NUTCH-862.patch When re-fetching a document (a continued crawl) HttpClient throws an null pointer exception causing the document to be emptied: 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching http://localhost/doc/selfhtml/html/index.htm 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:138) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of http://localhost/doc/selfhtml/html/index.htm failed with: java.lang.NullPointerException Because the document is re-fetched the server answers 304 (not modified): 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] GET /doc/selfhtml/html/index.htm HTTP/1.0 304 174 - Nutch-1.0 No content is sent in this case (empty http body). Index: trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java === --- trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (revision 979647) +++ trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (working copy) @@ -134,7 +134,8 @@ if (code == 200) throw new IOException(e.toString()); // for codes other than 200 OK, we are fine with empty content } finally { -in.close(); +if (in != null) + in.close(); get.abort(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-906. - Fix Version/s: 1.2 Resolution: Fixed Fixed in rev. 998261. Thanks! Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names Key: NUTCH-906 URL: https://issues.apache.org/jira/browse/NUTCH-906 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 1.1 Environment: Debian GNU/Linux 64-bit Reporter: Asheesh Laroia Assignee: Andrzej Bialecki Fix For: 1.2 Attachments: 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch Original Estimate: 0.33h Remaining Estimate: 0.33h The Nutch FAQ explains that OpenSearch includes all fields that are available at search result time. However, some Lucene column names can start with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch results for a document with a Lucene document column whose name starts with numbers, the underlying Xerces library throws this exception: org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. So I have written a patch that tests strings before they are used to generate tags within OpenSearch. I hope you merge this, or a better version of the patch! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109 ] Andrzej Bialecki commented on NUTCH-907: - That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality. DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API.patch Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working ;) I would appreciate a review and comments. REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757 ] Andrzej Bialecki commented on NUTCH-882: - +1 to NutchContext. See also NUTCH-907 because the changes required in Gora API will likely make this task easier (once implemented ;) ). Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.