Re: question about ObjectCache

2012-04-10 Thread Andrzej Bialecki

On 10/04/2012 05:00, Xiaolong Yang wrote:

Hi,all

I'm reading source code of nutch and I have some puzzled about the
ObjectCache.java in package org.apache.nutch.util.I just find it may be
little benefit to use it in urlnormalizers and urlfiters.I also have
read some discuss about cache in Nutch-169 and Nutch-501.But I can't
understand it.

Can anyone tell me where ObjectCache be used and get a good benefit in
nutch ?


ObjectCache is designed to cache ready-to-use instances of Nutch 
plugins. The process of finding, instantiating and initializing plugins 
is inefficient, because it involves parsing plugin descriptors, 
initializing plugins, collecting the ones that implement correct 
extension points, etc.


It would kill performance if this process were invoked each time you 
want to run all plugins of a given type (e.g. URLNormalizer-s). The 
facade URLNormalizers/URLFilters and others make sure that plugin 
instances of a given type are initialized once per lifetime of a JVM, 
and then they are cached in ObjectCache, so that next time you want to 
use them they can be retrieved from a cache, instead of going again 
through the process of parsing/instantiating/initializing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Andrzej Bialecki

On 02/03/2012 12:45, Lewis John Mcgibbney wrote:

Hi Guys,

As there were some comments on the user list, I recently got digging
with http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually'
don't consider them to be interlinked as such and therefore struggle to
debug how and why either the redirect or the crawl delay pages are not
being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start
thinking about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
users.


Yes.


- Can someone shine some light on what happened to Fetcher2.java that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)


Fetcher2 is the current Fetcher. The original Fetcher was temporarily 
renamed OldFetcher and then removed.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187927#comment-13187927
 ] 

Andrzej Bialecki  commented on NUTCH-1201:
--

I agree that there are situations where you might want a custom fetcher (e.g. 
depth-first crawling), and it would be good to come up with some more specific 
API than just MapRunner.

I'm not convinced yet that providing interfaces (or rather abstract classes) 
for the existing plumbing in Fetcher is a good idea - let's figure out first 
whether this code is reusable at all for some other fetching strategies, 
because if it's not then providing custom queue impls. may offer little value, 
and perhaps customization should be implemented on a different level.

Re. thread spinning - I haven't seen yet an unequivocal case that would prove 
that crawl contention is caused by the thread mgmt in Fetcher. Usually on 
closer look the bottleneck turned out to lie elsewhere (network io, remote 
throttling, dns lookups, politeness rules, etc).

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-14 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186212#comment-13186212
 ] 

Andrzej Bialecki  commented on NUTCH-1247:
--

Indeed, line 264 increases the retry counter, but after it reaches retryMax 
then page status is set to DB_GONE, so it won't be generated again until it 
expires, and its retry counter won't increase. Once it expires then Generator 
should invoke FetchSchedule.forceRefetch on this page, and the default 
implementation resets the retry counter. So either there's some bug in this 
cycle, or your retryMax is greater than 127.

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-13 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185908#comment-13185908
 ] 

Andrzej Bialecki  commented on NUTCH-1247:
--

Originally the reason for a byte was compactness, but we can get the same 
effect using vint.

Markus, something seems off in your setup if you get such high values of 
retries ... usually CrawlDbReducer will set STATUS_DB_GONE if the number of 
retries reaches db.fetch.retry.max, so the page will not be tried again until 
FetchSchedule.forceRefetch resets its status (and the number of retries).

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki

On 28/12/2011 12:00, Lewis John Mcgibbney wrote:

Hi Guys,

Pretty strange compilation failure, this test class hasn't been hacked
in months, and from the surface, having looked at the test case there
appears to be no obvious reasons for it failing to compile. I've kick
started another build on Jenkins to see if it will resolve itself.


I don't think it will - I can reproduce this failure locally. Here's 
what fixed the failure for me (I'm pretty ignorant about ivy/maven so 
there's likely a more correct fix for this):


Index: ivy/ivy.xml
===
--- ivy/ivy.xml (revision 1225046)
+++ ivy/ivy.xml (working copy)
@@ -69,7 +69,7 @@
!--Configuration: test --

!--artifacts needed for testing --
-   dependency org=junit name=junit rev=3.8.1 
conf=test-default /
+   dependency org=junit name=junit rev=3.8.1 conf=*-default 
/
dependency org=org.apache.hadoop name=hadoop-test 
rev=0.20.205.0
conf=test-default /


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki

On 28/12/2011 14:15, Lewis John Mcgibbney wrote:

Hi Andrzej,

Can anyone confirm? I've tried this patch locally and although I
couldn't reproduce the original issue, it seems to be working fine for
me as well.


Check your lib/ dir, maybe you have a local copy of junit jar that gets 
pulled on the classpath and masks the issue? this happened to me once or 
twice...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Andrzej Bialecki

On 15/12/2011 13:13, Markus Jelsma wrote:

hmm, i don't see how i can use the old mapred MapOutputFormat API with the new
Job API. job.setOutputFormatClass(MapFileOutputFormat.class) expects an the
mapreduce.lib.output.MapFileOutputFormat class and won't accept the old API.

setOutputFormatClass(java.lang.Class? extends
org.apache.hadoop.mapreduce.OutputFormat) in org.apache.hadoop.mapreduce.Job
cannot be applied to
(java.lang.Classorg.apache.hadoop.mapred.MapFileOutputFormat)

In short, i don't know how i can migrate jobs to the new API on 0.20.x without
having MapFileOutputFormat present in the new API. Trying to set to old
mapoutputformat


Ah, no, that's now what I meant ... of course you need to change the 
code to use the new api, and the new code will look quite different :) 
my point was only that it is different in a consistent way, so after 
you've ported one or two classes the other ones are easy to convert, too...


I'm bogged with other work now, but I'll see if I can prepare an example 
later today...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki

On 14/12/2011 16:01, Markus Jelsma wrote:

This is highly annoying, MapFileOutputFormat is not present in the MapReduce
API until 0.21!


AFAIK that's not the case ... there is both an old api and a new api 
implementation (the old one is deprecated). The new api is in 
org.apache.hadoop.mapreduce.lib.output .


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki

On 14/12/2011 18:30, Markus Jelsma wrote:

proper link:

http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapreduce/lib/output/package-summary.html


I thought the goal was to upgrade to 0.22, where this class is present. 
In 0.20.205 org.apache.hadoop.mapred.MapFileOutputFormat still uses the 
old api, and it's not deprecated yet.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 17:42, Lewis John Mcgibbney wrote:

Hi Markus,

I'm certainly in agreement here. If you like to open a Jira, we can
begin the build up a picture of what is required.

Lewis

On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
markus.jel...@openindex.io  wrote:

Hi,

To keep up with the rest of the world i believe we should move from the old
Hadoop mapred API to the new MapReduce API, which has already been done for
the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
done in Ivy but all jobs must be tackled and we have many jobs!

Anyone to give pointers and helping hand in this large task?


I guess the question is also whether the 0.22 is compatible enough to 
compile more or less with the existing code that uses the old api. If it 
does, then we can do the transition gradually, if it doesn't then it's a 
bigger issue.


This is easy to verify - just drop in the 0.22 jars and see if it 
compiles / tests are passing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 18:04, Markus Jelsma wrote:

Hi

I did a quick test to see what happens and it won't compile. It cannot find
our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but
won't run and many tests fail with stuff like.

Exception in thread main java.lang.NoClassDefFoundError:
org/codehaus/jackson/map/JsonMappingException
 at
org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421)


Hmm... what's that? I don't see this class (or this package) in the 
Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.



 at
org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431)
Caused by: java.lang.ClassNotFoundException:
org.codehaus.jackson.map.JsonMappingException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 ... 4 more

I think this can be overcome but we cannot hide from the fact that all jobs
must be ported to the new API at some point.

You did some work on the new API's, did you come across any cumbersome issues
when working on it?


It was quite some time ago .. but I don't remember anything being really 
complicated, it was just tedious - and once you've done one class the 
other classes follow roughly the same pattern.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Resolved] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-28 Thread Andrzej Bialecki (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-1213.
--

Resolution: Fixed

Committed in rev. 1207217, thanks for the review.

 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-25 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1213:
-

Attachment: NUTCH-1213.diff

Path that implements this functionality. SolrParams can be passed as an 
URL-like string, for example:
{code}
nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb 
-params update.chain=distribfmap.a=links segments/2025105233
{code}

 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1213) Pass additional SolrParams when indexing to Solr

2011-11-25 Thread Andrzej Bialecki (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157077#comment-13157077
 ] 

Andrzej Bialecki  edited comment on NUTCH-1213 at 11/25/11 10:26 AM:
-

Patch that implements this functionality. SolrParams can be passed as an 
URL-like string, for example:
{code}
nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb 
-params update.chain=distribfmap.a=links segments/2025105233
{code}

  was (Author: ab):
Path that implements this functionality. SolrParams can be passed as an 
URL-like string, for example:
{code}
nutch solrindex http://localhost:8983/solr/collection1 db -linkdb linkdb 
-params update.chain=distribfmap.a=links segments/2025105233
{code}
  
 Pass additional SolrParams when indexing to Solr
 

 Key: NUTCH-1213
 URL: https://issues.apache.org/jira/browse/NUTCH-1213
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-1213.diff


 This is a simple improvement of the SolrIndexer. It adds the ability to pass 
 additional Solr parameters that are applied to each UpdateRequest. This is 
 useful when you have to pass parameters specific to a particular indexing 
 run, which are not in Solr invariants for the update handler, and modifying 
 the Solr configuration for each different indexing run is inconvenient.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Dependency Injection

2011-11-23 Thread Andrzej Bialecki

On 23/11/2011 01:02, Andrzej Bialecki wrote:

On 22/11/2011 19:47, PJ Herring wrote:

Hey Chris,

Thanks for the response. I looked at the documents you sent me, and I
really do think incorporating some kind of DI Framework could be a great
addition to Nutch.

I have a general plan of attack, but I'll try to write that up more
formally and send it out to get some kind of feedback.


This sounds interesting. As Chris mentioned, the current plugin system
is far from ideal, but so far it worked reasonably well. The key
functionality that it implements is:

* self-discovery of services provided by each plugin,
* easy pluggability, by the virtue of dropping super-jars (jars with
impl. classes and nested library jars) to a predefined location,
* controlled classloader isolation between plugins so that incompatible
versions of libraries can be used
* but also ability to export specified classes and libraries so that one
plugin can use other plugin's exported resources on its classpath.
* optional auto-loading of dependent plugins

In the past one contributor made a bold attempt to port Nutch to OSGI,
and it turned out to be much more complicated than we expected, and with
a bigger impact on the way Nutch applications were supposed to run ...
so at that time we didn't think this complication was justified.

If we can figure out something between full-blown OSGI and the current
system then that would be great.



You may also want to take a look at JSPF (http://code.google.com/p/jspf) 
which perhaps could be made to satisfy the above requirements without 
too much refactoring.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dependency Injection

2011-11-22 Thread Andrzej Bialecki

On 22/11/2011 19:47, PJ Herring wrote:

Hey Chris,

Thanks for the response. I looked at the documents you sent me, and I
really do think incorporating some kind of DI Framework could be a great
addition to Nutch.

I have a general plan of attack, but I'll try to write that up more
formally and send it out to get some kind of feedback.


This sounds interesting. As Chris mentioned, the current plugin system 
is far from ideal, but so far it worked reasonably well. The key 
functionality that it implements is:


* self-discovery of services provided by each plugin,
* easy pluggability, by the virtue of dropping super-jars (jars with 
impl. classes and nested library jars) to a predefined location,
* controlled classloader isolation between plugins so that incompatible 
versions of libraries can be used
* but also ability to export specified classes and libraries so that one 
plugin can use other plugin's exported resources on its classpath.

* optional auto-loading of dependent plugins

In the past one contributor made a bold attempt to port Nutch to OSGI, 
and it turned out to be much more complicated than we expected, and with 
a bigger impact on the way Nutch applications were supposed to run ... 
so at that time we didn't think this complication was justified.


If we can figure out something between full-blown OSGI and the current 
system then that would be great.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Signature == null ?

2011-11-15 Thread Andrzej Bialecki

On 15/11/2011 20:33, Markus Jelsma wrote:

It's back again! Last try if someone has a pointer for this.
Cheers


After some DB updates, they're gone! Anyone recognizes this phenomenon?

On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:

On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:

Hi guys,

I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records and
their signatures. I had to add a sanity check on signature to avoid a
NPE. I had the assumption any record with such DB_ status has to have a
signature, right?

Why does roughly 0.0001625% of my records exit without a signature?


Now with correct metrics:
Why does roughly 0.84% of my records exist without a signature?


This could be somehow related to pages that come from redirects so that 
when they are fetched they are accounted for under different urls, which 
in turn may confuse the update code in CrawlDbReducer... Do you notice 
any pattern to these pages? What's their origin?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Persistent problems with Ivy dependencies in Eclipse

2011-11-10 Thread Andrzej Bialecki

On 10/11/2011 04:39, Lewis John Mcgibbney wrote:

Gets even more strange, both SWFParser and AutomationURLFilter import
additonal depenedencies, however they are not included within thier
plugin/ivy/ivy.xml files!

Am I missing something here?


Most likely these problems come from the initial porting of a pure ant 
build to an ant+ivy build. We should determine what deps are really 
needed by these plugins, and sanitize the ivy.xml files so that they 
make sense - if the existing files can't be untangled we can ditch them 
and come up with new, clean ones.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1139) Indexer to delete documents

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147722#comment-13147722
 ] 

Andrzej Bialecki  commented on NUTCH-1139:
--

I suggest renaming the option to -deleteGone, to make it more obvious what it's 
supposed to do.

 Indexer to delete documents
 ---

 Key: NUTCH-1139
 URL: https://issues.apache.org/jira/browse/NUTCH-1139
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1139-1.4-1.patch


 Add an option -delete to the solrindex command. With this feature enabled 
 documents of the currently processing segment with status FETCH_GONE or 
 FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore.
 This issue is a follow up of NUTCH-1052.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-11-10 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147723#comment-13147723
 ] 

Andrzej Bialecki  commented on NUTCH-1061:
--

+1.

 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
 -

 Key: NUTCH-1061
 URL: https://issues.apache.org/jira/browse/NUTCH-1061
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1061-1.4-1.patch


 Here's a migrating resetTitle method to use Apache ORO. There was no unit 
 test for this method so i added it. The test passes with old Apache ORO impl. 
 and with the new j.u.regex impl.
 Please comment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS

2011-11-05 Thread Andrzej Bialecki

On 05/11/2011 06:44, Mattmann, Chris A (388J) wrote:

Hey Guys,

I modified the Jenkins jobs that Lewis set up to now:

* poll SCM hourly for changes to Nutch
* publish Maven snapshots (1.5-SNAPSHOT) and above of Nutch
to repository.apache.org


Very useful - thanks a lot!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1196) Update job should impose an upper limit on the number of inlinks (nutchgora)

2011-11-04 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144226#comment-13144226
 ] 

Andrzej Bialecki  commented on NUTCH-1196:
--

Very nicely done and useful patch! A few cosmetic comments:

* a common pattern in Hadoop is to reuse object instances as much as possible, 
so any places where you use the new operator should be reviewed (e.g. new 
NutchWritable(...)).
* in UrlScoreComparator.compare(o1, o2) you can just use unary minus instead of 
multiplication by -1.
* in DbUpdateMapper you can assign a score of Float.MAX_VALUE to the web page 
record, this way in DbUpdateReducer.reduce you won't have to iterate over all 
entries, because the web page record will always come as the first, and we can 
save some time by skipping the remaining entries. Unless you really want to 
tally the number of skipped inlinks.

Overall the patch looks good, +1 for committing.

 Update job should impose an upper limit on the number of inlinks (nutchgora)
 

 Key: NUTCH-1196
 URL: https://issues.apache.org/jira/browse/NUTCH-1196
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
 Fix For: nutchgora

 Attachments: NUTCH-1196.patch


 Currently the nutchgora branch does not limit the number of inlinks in the 
 update job. This will result in some nasty out-of-memory exceptions and 
 timeouts when the crawl is getting big. Nutch trunk already has a default 
 limit of 10,000 inlinks. I will implement this in nutchgora too. Nutch trunk 
 uses a sorting mechanism in the reducer itself, but I will implement it using 
 standard Hadoop components instead (should be a bit faster). This means:
 The keys of the reducer will be a {url,score} tuple.
 *Partitioning* will be done by {url}.
 *Sorting* will be done by {url,score}.
 Finally *grouping* will be done by {url} again.
 This ensures all indentical urls will be put in the same reducer, but in 
 order of scoring.
 Patch should be ready by tomorrow. Please let me know when you have any 
 comments or suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-03 Thread Andrzej Bialecki (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-1195.
--

Resolution: Fixed

Committed in rev. 1197319.

 Add Solr 4x (trunk) example schema
 --

 Key: NUTCH-1195
 URL: https://issues.apache.org/jira/browse/NUTCH-1195
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: schema-solr4.xml


 The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk 
 some of the class names have been changed, and some field types have been 
 redefined, so if you simply drop this schema into Solr it will cause severe 
 errors and indexing won't work.
 I propose to add a version of the schema.xml file that is tailored to Solr 
 4.x so that users can deploy this schema when indexing to Solr trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2011-11-03 Thread Andrzej Bialecki (Created) (JIRA)
Add statically configured field values to solrindex-mapping.xml
---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4


In some cases it's useful to be able to add to every document sent to Solr a 
set of predefined fields with static values. This could be implemented on the 
Solr side (with a custom UpdateRequestProcessor), but it may be less cumbersome 
to add them on the Nutch side.

Example: let's say I have several Nutch configurations all indexing to the same 
Solr instance, and I want each of them to add its identifier as a field in all 
documents, e.g. origin=web_crawl_1, origin=file_crawl, 
origin=unlimited_crawl, etc...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2011-11-03 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1197:
-

Attachment: NUTCH-1197.patch

Patch with the implementation. I added some javadocs, and a unit test for both 
the old and the new functionality.

 Add statically configured field values to solrindex-mapping.xml
 ---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: NUTCH-1197.patch


 In some cases it's useful to be able to add to every document sent to Solr a 
 set of predefined fields with static values. This could be implemented on the 
 Solr side (with a custom UpdateRequestProcessor), but it may be less 
 cumbersome to add them on the Nutch side.
 Example: let's say I have several Nutch configurations all indexing to the 
 same Solr instance, and I want each of them to add its identifier as a field 
 in all documents, e.g. origin=web_crawl_1, origin=file_crawl, 
 origin=unlimited_crawl, etc...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-02 Thread Andrzej Bialecki (Created) (JIRA)
Add Solr 4x (trunk) example schema
--

 Key: NUTCH-1195
 URL: https://issues.apache.org/jira/browse/NUTCH-1195
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4


The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk some 
of the class names have been changed, and some field types have been redefined, 
so if you simply drop this schema into Solr it will cause severe errors and 
indexing won't work.

I propose to add a version of the schema.xml file that is tailored to Solr 4.x 
so that users can deploy this schema when indexing to Solr trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1195) Add Solr 4x (trunk) example schema

2011-11-02 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1195:
-

Attachment: schema-solr4.xml

 Add Solr 4x (trunk) example schema
 --

 Key: NUTCH-1195
 URL: https://issues.apache.org/jira/browse/NUTCH-1195
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: schema-solr4.xml


 The conf/schema.xml that we ship works ok for Solr 3.x, but in Solr trunk 
 some of the class names have been changed, and some field types have been 
 redefined, so if you simply drop this schema into Solr it will cause severe 
 errors and indexing won't work.
 I propose to add a version of the schema.xml file that is tailored to Solr 
 4.x so that users can deploy this schema when indexing to Solr trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127427#comment-13127427
 ] 

Andrzej Bialecki  commented on NUTCH-1135:
--

A few comments from the author of this monstrosity ;) First, thanks Ferdy for 
taking time to work with this, it's much appreciated, we need to move forward 
on this. I agree that ultimately this test should be moved to Gora and become a 
part of a larger test suite that verifies correctness of concurrent 
multi-threaded and multi-process operations.

However, the immediate purpose of this class was to stress-test the existing 
Gora versions in usage patterns typical for Nutch, in order to verify that a 
particular version of Gora is a viable storage layer for Nutch - so the test 
tries to replicate typical Nutch scenarios. Remember that this has to work not 
only for a toy crawl in a single JVM in local mode, but also for a fully 
distributed parallel map-reduce crawl. Consequently:

* testMultiThread: tests a scenario of multiple threads in a single JVM all 
writing to the same storage instance. This replicates a scenario present e.g. 
in a single Fetcher task. If this test fails (assuming it's properly 
constructed!) then this means that Gora will fail, perhaps silently (see 
NUTCH-893), in a fundamental Nutch tool.

* testMultiProcess: tests a scenario of multiple processes running in multiple 
JVMs all writing to the same storage instance. This replicates a scenario of 
multiple map-reduce tasks all using the same storage config (shared storage, 
e.g. HSQLDB in server mode), and it's fundamental to all Nutch tools running on 
a cluster. In map-reduce jobs there are usually many concurrent tasks, and some 
of them may execute in several copies in parallel (speculative execution) and 
some others may fail catastrophically without proper cleanup - and Gora 
backends must just deal with it. If this test fails (again, assuming it's 
properly constructed and doesn't exceed some OS capabilities of the test 
machine, or some known limits of a storage impl. like the number of concurrent 
connections) then it means that Gora storage is not reliable for a typical 
map-reduce usage, which sort of defeats the point of using it at all.

To summarize: I think the patch in its current form helps the tests pass, but I 
don't think it addresses the underlying problems in Gora (or perhaps the 
problems with HSQL backend), rather it hides the problem. After all, we want 
the test to mean something if it passes, to verify that we can use Gora for 
more than a toy crawl, with guarantees of correctness in presence of concurrent 
updates.

If the above errors don't indicate issues with Gora, but instead are caused by 
exceeded OS or hsql limits, or hsql misconfiguration, then of course we should 
fix the configs and adjust the numbers so that they make sense. But with the 
proper config and proper numbers both tests should pass, otherwise we can't be 
sure that Gora is working properly at all.

 Fix TestGoraStorage for Nutchgora
 -

 Key: NUTCH-1135
 URL: https://issues.apache.org/jira/browse/NUTCH-1135
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: nutchgora

 Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch


 This issue is part of a larger target which aims to fix broken JUnit tests 
 for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1135) Fix TestGoraStorage for Nutchgora

2011-10-14 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127470#comment-13127470
 ] 

Andrzej Bialecki  commented on NUTCH-1135:
--

bq. if you prefer to keep the old TestGoraStorage structure 

Not really, I'm not against cleanup / breaking it up - if it makes sense let's 
go for it. My main concern was that by skipping the multi-process test 
altogether we would ignore testing a part of Gora functionality that is 
critical to Nutch (well, to any other map-reduce app, too, but we're doing 
Nutch here ;) ). 

Thank you for your persistence.

bq. By the way, I tested the testMultithreaded with a DataStore that is not 
thread safe

Excellent!

 Fix TestGoraStorage for Nutchgora
 -

 Key: NUTCH-1135
 URL: https://issues.apache.org/jira/browse/NUTCH-1135
 Project: Nutch
  Issue Type: Sub-task
  Components: storage
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: nutchgora

 Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch


 This issue is part of a larger target which aims to fix broken JUnit tests 
 for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125712#comment-13125712
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

That's unexpected :) I checked the patch and I can't see where the bug could be 
... Did you make sure that your config is correct, and that the job actually 
sees the right value of this property in the config (check the job.xml via 
JobTracker)? TestDOMContentUtils indicates that it should work, so we need to 
make sure that the flag has correct value.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target

Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki

On 12/10/2011 13:17, Markus Jelsma (Commented) (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125717#comment-13125717
 ]

Markus Jelsma commented on NUTCH-797:
-

This test was on a local instance. I tried both values for 
parser.fix.embeddedparams with:
$ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek


Is this how it should be implemented? I'm not sure. Embedded params are a bit 
puzzling :)


Hmm ... if that's the exact command-line expression that you entered 
then if you are using a *nix shell the semicolon would mean the end of 
command, so in fact what was executed would be:


$ bin/nutch parsechecker http://www.funkybabes.nl/
...lots of output ...
bash: ROOOWAN/fotoboek: command not found


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125916#comment-13125916
 ] 

Andrzej Bialecki  commented on NUTCH-1097:
--

+1, the latest patch looks good.

 application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
 multiple mimetypes for plugin.xml
 

 Key: NUTCH-1097
 URL: https://issues.apache.org/jira/browse/NUTCH-1097
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Ferdy
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1097-nutchgora_v1.patch, 
 NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
 NUTCH-1097-v3.patch, NUTCH-1097-v4.patch


 The configuration in parse-plugins.xml expects the parse-html plugin to 
 accept application/xhtml+xml, however the plugin.xml of this plugin does not 
 list this type. Either change the entry in parse-plugins.xml or change the 
 parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125931#comment-13125931
 ] 

Andrzej Bialecki  commented on NUTCH-1142:
--

+1, the patch looks good.

(There is one philosophical :) aspect of this change, as with any situation 
where you calculate PageRank in presence of URL filtering: does it matter that 
a page was linked to from other pages that you decided to filter out? I.e. in 
Pagerank the relative page importance is a function of in-degree, and by 
filtering out incoming links you change the in-degree. This essentially means 
that you decide to ignore some evidence of a page being possibly more 
important, due to links from pages that may not be interesting to you but which 
still do exist. OTOH the incoming links may have been spam, so one would expect 
that in the grand picture it evens out.)

 Normalization and filtering in WebGraph
 ---

 Key: NUTCH-1142
 URL: https://issues.apache.org/jira/browse/NUTCH-1142
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch, 
 NUTCH-1142-1.5-3.patch


 The WebGraph programs performs URL normalization. Since normalization of 
 outlinks is already performed during the parse it should become optional. 
 There is also no URL filtering mechanism in the web graph program. When a 
 CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
 to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124737#comment-13124737
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

The fixup code in Tika is still a private method in HtmlParser, so in this case 
the upgrade to Tika 0.10 won't help, we still have to apply the above patch.

I'll commit this shortly.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.4, nutchgora

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125016#comment-13125016
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

Uhh, sorry - I'll fix this in a moment.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1)
 + {
 + baseRightMost

[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125077#comment-13125077
 ] 

Andrzej Bialecki  commented on NUTCH-797:
-

I'm puzzled by the algorithm in fixEmbeddedParams (which was refactored into 
URLUtil), and I don't understand how it was ever supposed to work. If I enable 
this method then most of the test URLs in TestURLUtil fail, because they are 
not resolved according to the RFC.

In your example in NUTCH-1115, what was the expected result of resolving the 
base url http://www.funkybabes.nl/;ROOOWAN/fotoboek; and e.g. a target of 
forumregels ?

* http://www.funkybabes.nl/forumregels
* http://www.funkybabes.nl/;ROOOWAN/forumregels
* http://www.funkybabes.nl/forumregels;ROOOWAN
* none of the above ;)

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar

[jira] [Updated] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-11 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-797:


Attachment: NUTCH-797.patch

Tentative patch, which changes the meaning of fixEmbeddedParams to 
removeEmbeddedParams.

 parse-tika is not properly constructing URLs when the target begins with a ?
 --

 Key: NUTCH-797
 URL: https://issues.apache.org/jira/browse/NUTCH-797
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Win 7, Java(TM) SE Runtime Environment (build 
 1.6.0_16-b01)
 Also repro's on RHEL and java 1.4.2
Reporter: Robert Hohman
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch


 This is my first bug and patch on nutch, so apologies if I have not provided 
 enough detail.
 In crawling the page at 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0 there are 
 links in the page that look like this:
 a href=?co=0sk=0p=2pi=12/a/tdtda 
 href=?co=0sk=0p=3pi=13/a
 in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
 getOutlinks looks for links, it comes across this link, and constucts a new 
 url with a base URL class built from 
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0;, and a 
 target of ?co=0sk=0p=2pi=1
 The URL class, per RFC 3986 at 
 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
 how to merge these two, and per the RFC, the URL class merges these to: 
 http://careers3.accenture.com/Careers/ASPX/?co=0sk=0p=2pi=1
 because the RFC explicitly states that the rightmost url segment (the 
 Search.aspx in this case) should be ripped off before combining.
 While this is compliant with the RFC, it means the URLs which are created for 
 the next round of fetching are incorrect.  Modern browsers seem to handle 
 this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
 exception or handling of what is a poorly formed url on accenture's part.
 I have fixed this by modifying DOMContentUtils to look for the case where a ? 
 begins the target, and then pulling the rightmost component out of the base 
 and inserting it into the target before the ?, so the target in this example 
 becomes:
 Search.aspx?co=0sk=0p=2pi=1
 The URL class then properly constructs the new url as:
 http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0p=2pi=1
 If it is agreed that this solution works, I believe the other html parsers in 
 nutch would need to be modified in a similar way.
 Can I get feedback on this proposed solution?  Specifically I'm worried about 
 unforeseen side effects.
 Much thanks
 Here is the patch info:
 Index: 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ===
 --- 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(revision 916362)
 +++ 
 src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
(working copy)
 @@ -299,6 +299,50 @@
  return false;
}

 +  private URL fixURL(URL base, String target) throws MalformedURLException
 +  {
 +   // handle params that are embedded into the base url - move them to 
 target
 +   // so URL class constructs the new url class properly
 +   if  (base.toString().indexOf(';')  0)  
 +  return fixEmbeddedParams(base, target);
 +   
 +   // handle the case that there is a target that is a pure query.
 +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
 how to assemble
 +   // URLs but I've seen this in numerous places, for example at
 +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0sk=0
 +   // It has urls in the page of the form href=?co=0sk=0pg=1, and by 
 default
 +   // URL constructs the base+target combo as 
 +   // http://careers3.accenture.com/Careers/ASPX/?co=0sk=0pg=1, 
 incorrectly
 +   // dropping the Search.aspx target
 +   //
 +   // Browsers handle these just fine, they must have an exception 
 similar to this
 +   if (target.startsWith(?))
 +   {
 +   return fixPureQueryTargets(base, target);
 +   }
 +   
 +   return new URL(base, target);
 +  }
 +  
 +  private URL fixPureQueryTargets(URL base, String target) throws 
 MalformedURLException
 +  {
 + if (!target.startsWith(?))
 + return new URL(base, target);
 +
 + String basePath = base.getPath();
 + String baseRightMost=;
 + int baseRightMostIdx = basePath.lastIndexOf(/);
 + if (baseRightMostIdx != -1

[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125414#comment-13125414
 ] 

Andrzej Bialecki  commented on NUTCH-1097:
--

+1 the idea makes sense. Patch looks good, but it needs a minor fix - mime 
types may contain also . characters, e.g. application/vnd.ms-word, and 
these need to be escaped too.

 application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
 multiple mimetypes for plugin.xml
 

 Key: NUTCH-1097
 URL: https://issues.apache.org/jira/browse/NUTCH-1097
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Ferdy
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
 NUTCH-1097-v2.patch, NUTCH-1097-v3.patch


 The configuration in parse-plugins.xml expects the parse-html plugin to 
 accept application/xhtml+xml, however the plugin.xml of this plugin does not 
 list this type. Either change the entry in parse-plugins.xml or change the 
 parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-10 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124428#comment-13124428
 ] 

Andrzej Bialecki  commented on NUTCH-1154:
--

TIKA-748 has been fixed and is scheduled to be included in Tika 1.0. If there 
are not objections I'd like to commit Tika 0.10, put a comment in CHANGES.txt, 
and disable this part of the test until we upgrade to Tika 1.0.

 Upgrade to Tika 0.10
 

 Key: NUTCH-1154
 URL: https://issues.apache.org/jira/browse/NUTCH-1154
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: NUTCH-1154.diff


 There have been significant improvements in Tika 0.10 and it would be nice to 
 use the latest Tika in 1.4.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1152) Upgrade to SolrJ 3.4.0

2011-10-07 Thread Andrzej Bialecki (Created) (JIRA)
Upgrade to SolrJ 3.4.0
--

 Key: NUTCH-1152
 URL: https://issues.apache.org/jira/browse/NUTCH-1152
 Project: Nutch
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
 Fix For: 1.4


Current release of Lucene/Solr is 3.4.0, but we're still using 3.1.0. The fix 
is trivial, just replace 3.1.0 with 3.4.0 in ivy.xml. If there are no 
objections I'll make the change shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1152) Upgrade to SolrJ 3.4.0

2011-10-07 Thread Andrzej Bialecki (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-1152.
--

Resolution: Fixed
  Assignee: Andrzej Bialecki 

Committed in rev. 1180087. This commit also upgrades SLF4J as a dependency of 
SolrJ, to release 1.6.1.

 Upgrade to SolrJ 3.4.0
 --

 Key: NUTCH-1152
 URL: https://issues.apache.org/jira/browse/NUTCH-1152
 Project: Nutch
  Issue Type: Improvement
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.4


 Current release of Lucene/Solr is 3.4.0, but we're still using 3.1.0. The fix 
 is trivial, just replace 3.1.0 with 3.4.0 in ivy.xml. If there are no 
 objections I'll make the change shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-07 Thread Andrzej Bialecki (Created) (JIRA)
Upgrade to Tika 0.10


 Key: NUTCH-1154
 URL: https://issues.apache.org/jira/browse/NUTCH-1154
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Andrzej Bialecki 


There have been significant improvements in Tika 0.10 and it would be nice to 
use the latest Tika in 1.4.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1154) Upgrade to Tika 0.10

2011-10-07 Thread Andrzej Bialecki (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-1154:
-

Attachment: NUTCH-1154.diff

Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser fails with this 
version of Tika - the extracted body of the text is empty. See TIKA-748. Still, 
I think the improvements in PDF and Office parsers are worth the upgrade.

 Upgrade to Tika 0.10
 

 Key: NUTCH-1154
 URL: https://issues.apache.org/jira/browse/NUTCH-1154
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: NUTCH-1154.diff


 There have been significant improvements in Tika 0.10 and it would be nice to 
 use the latest Tika in 1.4.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1124) JUnit test for scoring-opic

2011-10-05 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120982#comment-13120982
 ] 

Andrzej Bialecki  commented on NUTCH-1124:
--

Our implementation is most definitely inaccurate (broken?), though I'm not sure 
if the original OPIC algorithm is better.

The original OPIC paper explains that each node needs to give away all its 
cash, and then receive cash from other nodes, but in their experiments this led 
to a yo-yo instability of large amounts of cash floating in and out, in 
response to changes in the graph and the fact that there is a delay of a full 
re-crawl cycle, i.e. all known urls need to be re-crawled in order to collect 
and redistribute all cash that is potentially floating in the graph. In order 
to dampen this effect they added buffering - a history of the latest N scores, 
and they would consider an average of these scores. This resulted in smoothing 
and dampening of changes, but it's an artificial hack that is sensitive to the 
dynamics of changes in the webgraph and the speed of re-crawl.

Our implementation of OPIC doesn't give away cash at all, instead it duplicates 
it and then distributes, which causes the total amount of cash floating in a 
webgraph to double in each cycle even when a graph is static. We could fix this 
by giving away all cash and then introducing a mechanism to collect all cash 
from dangling nodes (without outlinks) to redistribute it evenly to all nodes. 
This would bring us closer to the original OPIC without smoothing. Still, I 
expect the same instability would occur, especially in the face of a changing 
graph.

 JUnit test for scoring-opic
 ---

 Key: NUTCH-1124
 URL: https://issues.apache.org/jira/browse/NUTCH-1124
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Move 2.0 out of trunk

2011-09-20 Thread Andrzej Bialecki

On 18/09/2011 02:21, Julien Nioche wrote:

Hi,

Following the discussions [1] on the dev-list about the future of Nutch
2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
to a separate branch, promote 1.4 to trunk and consider 2.0 as
unmaintained. The arguments for / against can be found in the thread I
mentioned.

The vote is open for the next 72 hours.

[ ] +1 : Shelve 2.0 and move 1.4 to trunk
[] 0 : No opinion
[] -1 : Bad idea.  Please give justification.


+1 - at this time it's clear that 2.0 didn't pan out as we expected, and 
we should restart from the 1.x for a usable platform, and continue 
redesign from that codebase.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2011-08-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089405#comment-13089405
 ] 

Andrzej Bialecki  commented on NUTCH-1087:
--


IIRC we had this discussion in the past... It's true that we already rely on 
Bash to do anything useful, no matter whether it's on Windows or on a *nix-like 
OS. And it's true that the crawl command has been a constant source of 
confusion over the years. The crawl application also suffered from some subtle 
bugs, especially when running in local mode (e.g. the PluginRepository leaks).

But the argument about maintenance costs is IMHO moot - you have to maintain a 
shell script, too, so it's no different from maintaining a Java class. Where it 
differs, I think, is that moving the crawl cycle logic to a shell script now 
raises the bar for Java developers who are not familiar with Bash scripting - a 
robust crawl script is not easy to follow, as it needs to handle error 
conditions and manage input/output resources on HDFS. On the other hand it's 
easier for system admins to tweak a script rather than tweaking a Java code... 
so I guess it's also a question of who's the audience for this functionality.

I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't 
change the picture in any significant way.


 Deprecate crawl command and replace with example script
 ---

 Key: NUTCH-1087
 URL: https://issues.apache.org/jira/browse/NUTCH-1087
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.4


 * remove the crawl command
 * add basic crawl shell script
 See thread:
 http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2011-07-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13067972#comment-13067972
 ] 

Andrzej Bialecki  commented on NUTCH-1014:
--

java.util.regex has the advantage of being a part of the JRE. However, it is 
quite slow for more complex regexes. See e.g. this benchmark: 
http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger 
crawls this is especially important when using regexes for URL filtering and 
normalization - an innocent-looking regex can melt the cpu when processing a 
64kB long junk URL, and consequently it can stall the crawl... In such cases 
it's good to have an option to fall back to a subset of regex features and use 
a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex 
(but also it isn't maintained anymore). Brics lacks support for many operators, 
but it's fast. Perhaps ICU4j would be a good alternative - it's fully 
JDK-compatible and offers good performance.

 Migrate from Apache ORO to java.util.regex
 --

 Key: NUTCH-1014
 URL: https://issues.apache.org/jira/browse/NUTCH-1014
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.4, 2.0


 A separate issue tracking migration of all components from Apache ORO to 
 java.util.regex. Components involved are:
 - RegexURLNormalzier
 - OutlinkExtractor
 - JSParseFilter
 - MoreIndexingFilter
 - BasicURLNormalizer

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr

2011-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034724#comment-13034724
 ] 

Andrzej Bialecki  commented on NUTCH-985:
-

We should use the Solr's DateUtil in all such places, to avoid code duplication 
and confusion should the date format ever change... The patch does essentially 
the same what DateUtil does, only the DateUtil reuses SimpleDateFormat 
instances in a thread-safe way, so it's more efficient.

 MoreIndexingFilter doesn't use properly formatted date fields for Solr
 --

 Key: NUTCH-985
 URL: https://issues.apache.org/jira/browse/NUTCH-985
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3, 2.0
Reporter: Dietrich Schmidt
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-985-trunk-1.patch, NUTCH-985.1.3-1.patch, 
 indexlastmodifieddate.jar


 I am using the index-more plugin to parse the lastModified data in web
 pages in order to store it in a Solr data field.
 In solrindex-mapping.xml I am mapping lastModified to a field changed in 
 Solr:
 field dest=changed source=lastModified/
 However, when posting data to Solr the SolrIndexer posts it as a long,
 not as a date:
 adddoc boost=1.0field
 name=changed107932680/fieldfield
 name=tstamp20110414144140188/fieldfield
 name=date20040315/field
 Solr rejects the data because of the improper data type.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki

On 3/18/11 4:31 PM, Markus Jelsma wrote:

Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963
to trunk after committing to 1.3. There are of course a lot of differences so
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?


Actually you need to use StorageUtils to set up Mapper or Reducer 
contexts. See other tools, e.g. Fetcher or Generator.



- trunk uses slf instead of commons logging now?


Yes.


- a page is now represented by storage.WebPage?


Yes. When you prepare a Job you also need to specify what fields from 
WebPage you are interested in (and only these fields will be pulled in 
from the storage). This is all handled by StorageUtils methods.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-10 Thread Andrzej Bialecki

On 3/10/11 10:57 PM, Julien Nioche (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-951.
---


NUTCH-825 committed in revision 1080368
All the known improvements from 2.0 have been backported into 1.3 now



The only remaining issue to address before rolling out a 1.3 release is 
NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Resolved: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-951.
-

Resolution: Fixed

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13004488#comment-13004488
 ] 

Andrzej Bialecki  commented on NUTCH-951:
-

* Ported NUTCH-872 in rev. 1079746.
* Ported NUTCH-876 in rev. 1079753.
* Ported NUTCH-921 in rev. 1079760.
* NUTCH-884 is not applicable to 1.3 because here fetching executes in map 
tasks, so there's a correct number of them already.

 Backport changes from 2.0 into 1.3
 --

 Key: NUTCH-951
 URL: https://issues.apache.org/jira/browse/NUTCH-951
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.3
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.3


 I've compared the changes from 2.0 with 1.3 and found the following 
 differences (excluding anything specific to 2.0/GORA)
 *  NUTCH-564 External parser supports encoding attribute (Antony 
 Bowesman, mattmann)
 *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
 *  NUTCH-825 Publish nutch artifacts to central maven repository 
 (mattmann)
 *  NUTCH-851 Port logging to slf4j (jnioche)
 *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
 *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
 *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
 *  NUTCH-880 REST API for Nutch (ab)
 *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
 *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
 *  NUTCH-886 A .gitignore file for Nutch (dogacan)
 *  NUTCH-894 Move statistical language identification from indexing to 
 parsing step
 *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
 *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
 *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
 *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
 Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-962.
-

   Resolution: Fixed
Fix Version/s: 2.0
   1.3
 Assignee: Andrzej Bialecki 

 max. redirects not handled correctly: fetcher stops at max-1 redirects
 --

 Key: NUTCH-962
 URL: https://issues.apache.org/jira/browse/NUTCH-962
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2, 1.3, 2.0
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.3, 2.0

 Attachments: Fetcher_redir.patch


 The fetcher stops following redirects one redirect before the max. redirects 
 is reached.
 The description of http.redirect.max
  The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
 suggests that if set to 1 that one redirect will be followed.
 I tried to crawl two documents the first redirecting by
  meta http-equiv=refresh content=0; URL=./to/meta_refresh_target.html
 to the second with http.redirect.max = 1
 The second document is not fetched and the URL has state GONE in CrawlDb.
 fetching file:/test/redirects/meta_refresh.html
 redirectCount=0
 -finishing thread FetcherThread, activeThreads=1
  - content redirect to file:/test/redirects/to/meta_refresh_target.html 
 (fetching now)
  - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
 The attached patch would fix this: if http.redirect.max is 1 : one redirect 
 is followed.
 Of course, this would mean there is no possibility to skip redirects at all 
 since 0
 (as well as negative values) means treat redirects as ordinary links.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-955) Ivy configuration

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-955.
-

   Resolution: Fixed
Fix Version/s: 2.0
 Assignee: Andrzej Bialecki 

 Ivy configuration
 -

 Key: NUTCH-955
 URL: https://issues.apache.org/jira/browse/NUTCH-955
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Alexis
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: ivy.patch


 As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
 help setup the Gora backend more easily.
 If the user does not want to stick with default HSQL database, other 
 alternatives exist, such as MySQL and HBase.
 org.restlet and xercesImpl versions should be changed as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Gora/HBase dependencies and deploy artifacts

2011-02-24 Thread Andrzej Bialecki

Hi all,

Recently I've been deploying Nutch trunk to an already existing Hadoop 
cluster. And immediately I hit a snag.


Nutch was configured to use gora-hbase. The nutch.job jar doesn't 
include gora-hbase even if it was configured in nutch-site.xml. 
Furthermore, gora-hbase depends on HBase and its dependencies, which 
need to be found on classpath.


Typically for development and testing I solved this issue by deploying 
gora-core and gora-hbase + all hbase libs to hadoop/lib across the 
cluster. This is a bit dirty - Hadoop clusters should be seen as a 
generic computing fabric, so they should be application-agnostic, 
besides this creates maintenance  ops issues.


We could put all these libs in lib/ inside nutch.job, so that they are 
unpacked and put on classpath during task setup. This would work fine 
for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that 
InputFormat / OutputFormat classes were initialized prior to this 
unpacking - and in our case these depend on the libs in as-yet-unpacked 
job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in 
Hadoop 0.20.2, so his is something that needs to be tested).


Furthermore, even if we packed the jars in lib/ inside nutch.job, still 
many tools wouldn't work, because they depend on classes from those libs 
during the local execution (before the job is sent to task trackers), 
and the URLClassLoader can't load classes from jars within jars... A 
workaround for this would be to take all those jars and re-pack them 
together under / directory in nutch.job. This would satisfy the 
dependencies for local execution, and for Mapper/Reducer execution but 
I'm not sure if it solves the problem of Input/OutputFormat-s that I 
mentioned above.


In short, we need a clear working procedure how to deploy Gora backend 
implementations so that they work with Nutch and with a generic 
unmodified Hadoop cluster.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-939.
-

Resolution: Fixed
  Assignee: Andrzej Bialecki 

I modified the patch slightly to allow more flexibility (you can mix individual 
segment names and the -dir options) as well as allowing segments placed on 
different filesystems. Committed in rev. 1051505. Thank you!

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3
Reporter: Claudio Martella
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.3

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)
Remove Lucene dependencies
--

 Key: NUTCH-948
 URL: https://issues.apache.org/jira/browse/NUTCH-948
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.3
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.3


Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it 
uses DateTools in index-basic. DateTools should be replaced with Solr's 
DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-948.
-

Resolution: Fixed

Committed in rev. 1051509.

 Remove Lucene dependencies
 --

 Key: NUTCH-948
 URL: https://issues.apache.org/jira/browse/NUTCH-948
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.3
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.3


 Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely 
 it uses DateTools in index-basic. DateTools should be replaced with Solr's 
 DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
 dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973915#action_12973915
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

1.2 release is out, and branch-1.2 is unlikely to result in a subsequent 
release - most users seem to be interested either in 1.3 or trunk.

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3
Reporter: Claudio Martella
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.3

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Andrzej Bialecki

(switching to devs)

On 12/17/10 10:18 AM, Alexis wrote:

Hi,

I've spent some time working on this as well. I've just put together a
blog entry addressing the issues I ran into. See
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

In a nutchsell, I changed three pieces in Gora and Nutch code:
- flush the datastore regularly in the Hadoop RecordWriter (in GoraOutputFormat)


Careful here. DataStore flush may be very expensive, so it should be 
done only when we are finished with the output. If you see that data is 
lost without this flush then this should be reported as a Gora bug.



- wait for Hadoop job completion in the Fetcher job


I missed your previous email... I'll fix this shortly - thanks for 
spotting it.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Java.io.IOException with multiple copyField/ directives

2010-12-03 Thread Andrzej Bialecki
On 2010-12-03 09:52, Peter Litsegård wrote:
 Hi!
 
 I've run into a strange behaviour while using Nutch (solrindexer) together 
 with Solr 1.4.1. I'd like to copy the 'title' and 'content' field to another 
 field, say, 'foo'. In my first attempt I added the copyField/ directives in 
 schema.xml and got the java exception so I removed them from schema.xml. In 
 my second attempt I added the copyField/ directives to the 
 'solrindex-mapping.xml' file and ran into the same exception again! Is this a 
 known issue or have I stumbled into unknown territory?
 
 Any workarounds?

I suspect that the field type declared in your schema.xml is not
multiValued. What was the exception?


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-11-26 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936047#action_12936047
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

Please note that trunk uses a very different method of working with segments 
(called batches there), and -dir is not applicable there.

 Added -dir command line option to Indexer and SolrIndexer,  allowing to 
 specify directory containing segments
 -

 Key: NUTCH-939
 URL: https://issues.apache.org/jira/browse/NUTCH-939
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Claudio Martella
Priority: Minor
 Fix For: 1.2

 Attachments: Indexer.patch, SolrIndexer.patch


 The patches add -dir option, so the user can specify the directory in which 
 the segments are to be found. The actual mode is to specify the list of 
 segments, which is not very easy with hdfs. Also, the -dir option is already 
 implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-4.patch

Final version of the patch.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-932.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1039014.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-12 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-3.patch

NutchTool is an abstract class in this patch. This actually minimizes the 
amount of code throughout, though paradoxically the patch file is larger than 
before...

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928909#action_12928909
 ] 

Andrzej Bialecki  commented on NUTCH-880:
-

Thanks - this issue is already fixed in NUTCH-932, to be committed soon.

 REST API for Nutch
 --

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: API-2.patch, API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

This patch adds bulk retrieval of crawl results. This is still very rough, e.g. 
there's no way to select crawlId or limit the fields... but it returns proper 
JSON.

This patch also includes other enhancements and bugfixes - with this patch I 
was able to perform a complete crawl cycle via REST.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: db.formatted.gz

Example DB content (this was passed through a JSON pretty-printer, otherwise 
it's just one giant line...).

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928355#action_12928355
 ] 

Andrzej Bialecki  commented on NUTCH-932:
-

Examples (with the db equivalent to the one in db.formatted.gz):

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=urlend=http://www.freebsd.org/start=http://www.egothor.org/'|
 ./json_pp
[
  {
url: http://www.egothor.org/;
  }, 
  {
url: http://www.freebsd.org/;
  }
]
{code}

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentTypestart=http://www.getopt.org/end=http://www.getopt.org/'|
 ./json_pp
[
  {
contentType: text/html, 
url: http://www.getopt.org/;, 
markers: {
  _updmrk_: 1288890451-1134865895
}, 
parseStatus: success/ok (1/0), args=[], 
protocolStatus: SUCCESS, args=[], 
outlinks: {
  http://www.getopt.org/luke/: Luke, 
  http://www.getopt.org/ecimf/contrib/ONTO/REA: REA Ontology page, 
  http://www.getopt.org/CV.pdf: CV here, 
  http://www.getopt.org/utils/build/api: API, 
  
http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java:
 available here, 
  http://www.getopt.org/murmur/MurmurHash.java: MurmurHash.java, 
  http://www.ebxml.org/: ebXML / ebTWG, 
  http://www.freebsd.org/: FreeBSD, 
  http://www.getopt.org/luke/webstart.html: Launch with Java WebStart, 
  http://www.freebsd.org/%7Epicobsd: PicoBSD, 
  http://home.comcast.net/~bretm/hash/6.html: this discussion, 
  http://protege.stanford.edu/: Protege, 
  http://jakarta.apache.org/lucene: Lucene, 
  http://www.getopt.org/ecimf/contrib/ONTO/ebxml: ebXML Ontology, 
  http://www.getopt.org/ecimf/: here, 
  http://www.isthe.com/chongo/tech/comp/fnv/: his website, 
  http://www.getopt.org/stempel/index.html: Stempel, 
  http://www.sigram.com/: SIGRAM, 
  http://www.egothor.org/: Egothor, 
  http://thinlet.sourceforge.net/: Thinlet, 
  http://www.getopt.org/utils/dist/utils-1.0.jar: binary, 
  http://www.ecimf.org/: ECIMF
}
  }
]
{code}


 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

Updated patch - this recognizes now URL parameters such as fields, start/end 
keys, batch and crawl id.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-29 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-931.
-

Resolution: Fixed

Committed in rev. 1028736 with some changes.

 Simple admin API to fetch status and stop the service
 -

 Key: NUTCH-931
 URL: https://issues.apache.org/jira/browse/NUTCH-931
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-931.patch


 REST API needs a simple info / stats service and the ability to shutdown the 
 server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Summary: REST API for Nutch  (was: REST API (and webapp) for Nutch)

The webapp part is tracked now in NUTCH-929.

 REST API for Nutch
 --

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: API-2.patch, API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-880.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028235. The webapp part of this issue is tracked now in 
NUTCH-929.

 REST API for Nutch
 --

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: API-2.patch, API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)
Remove remaining dependencies on Lucene API
---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic 
SolrJ API. The only place where we still use a minor part of Lucene is in 
index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-930:


Attachment: NUTCH-930.patch

Patch to fix the issue. I'll commit this shortly.

 Remove remaining dependencies on Lucene API
 ---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-930.patch


 Nutch doesn't use Lucene API anymore, all indexing happens via 
 Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
 Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-930.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028474.

 Remove remaining dependencies on Lucene API
 ---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-930.patch


 Nutch doesn't use Lucene API anymore, all indexing happens via 
 Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
 Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-28 Thread Andrzej Bialecki (JIRA)
Simple admin API to fetch status and stop the service
-

 Key: NUTCH-931
 URL: https://issues.apache.org/jira/browse/NUTCH-931
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


REST API needs a simple info / stats service and the ability to shutdown the 
server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2010-10-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925543#action_12925543
 ] 

Andrzej Bialecki  commented on NUTCH-926:
-

bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!!
No need to shout, we hear you :)

Indeed, Nutch behavior when following redirects doesn't play well with the rule 
of ignoring external outlinks. Strictly speaking, redirects are not outlinks, 
but the silent assumption behind ignoreExternalOutlinks is that we crawl 
content only from that hostname.

And your patch would solve this particular issue. However, this is not as 
simple as it seems... My favorite example is www.ibm.com - 
www8.ibm.com/index.html . If we apply your fix you won't be able to crawl 
www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple 
equality of hostnames may not be sufficient. We have utilities to extract 
domain names, so we could compare domains but then we may mistreat 
money.cnn.com vs. weather.cnn.com ...

 Nutch follows wrong url in META http-equiv=refresh tag
 -

 Key: NUTCH-926
 URL: https://issues.apache.org/jira/browse/NUTCH-926
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.2
 Environment: gnu/linux centOs
Reporter: Marco Novo
Priority: Critical
 Fix For: 1.3

 Attachments: ParseOutputFormat.java.patch


 We have nutch set to crawl a domain urllist and we want to fetch only passed 
 domains (hosts) not subdomains.
 So
 WWW.DOMAIN1.COM
 ..
 ..
 ..
 WWW.RIGHTDOMAIN.COM
 ..
 ..
 ..
 ..
 WWW.DOMAIN.COM
 We sets nutch to:
 NOT FOLLOW EXERNAL LINKS
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WRONG.RIGHTDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG subdomains! But it should not do this!!
 During crawling of WWW.RIGHTDOMAIN.COM
 if a page contains
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
 html
 head
 title/title
 META http-equiv=refresh content=0;
 url=http://WWW.WRONGDOMAIN.COM;
 /head
 body
 /body
 /html
 Nutch continues to crawl the WRONG domain! But it should not do this! If that 
 we will spider all the web
 We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have 
 done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: ReviewBoard Instance

2010-10-26 Thread Andrzej Bialecki
On 2010-10-26 15:53, Mattmann, Chris A (388J) wrote:
 Hi Guys,
 
 Gav from infra@ set up a ReviewBoard instance for Apache [1]. I've never
 used it before but I thought I'd request an account on it for Nutch [2]
 regardless, so if folks want to use it, they can.

Hmm, I may be missing something... but what's the point of using the
tool in our JIRA-based workflow? It looks to me like it duplicates at
least part of JIRA's functionality, and the remaining part is what we do
also in JIRA by convention...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924659#action_12924659
 ] 

Andrzej Bialecki  commented on NUTCH-913:
-

+1, let's commit it -  I want to start playing with GORA-9, and that patch is 
in the org.apache namespace...

 Nutch should use new namespace for Gora
 ---

 Key: NUTCH-913
 URL: https://issues.apache.org/jira/browse/NUTCH-913
 Project: Nutch
  Issue Type: Bug
  Components: storage
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: NUTCH-913_v1.patch, NUTCH-913_v2.patch


 Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace 
 from org.gora to org.apache.gora. This means nutch should use the new 
 namespace otherwise it won't compile with newer builds of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924154#action_12924154
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This doesn't solve the problem of potentially unbounded number of fields. 
Compliance is one thing, and you can clean up field names from invalid 
characters, but sanity is another thing - if you have {{title_*}} in your Solr 
schema then theoretically you are allowed to create unlimited number of fields 
with this prefix - Solr won't complain.

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-924) Static field in solr mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923845#action_12923845
 ] 

Andrzej Bialecki  commented on NUTCH-924:
-

The functionality is useful, +1. But the patch has formatting errors. Please 
fix them before committing.

The same functionality should be added to trunk, too.

 Static field in solr mapping
 

 Key: NUTCH-924
 URL: https://issues.apache.org/jira/browse/NUTCH-924
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.3
Reporter: David Stuart
Assignee: Markus Jelsma
 Fix For: 1.3

 Attachments: nutch_1.3_static_field.patch

   Original Estimate: 0h
  Remaining Estimate: 0h

 Provide the facility to pass static data defined in solrindex-mapping.xml to 
 solr during the mapping process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923896#action_12923896
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This sounds useful, though the implementation needs to keep the following in 
mind:
* you _assume_ that the lang field will have a nice predictable value, but 
unless you sanitize the values you can't assume anything... example: one page I 
saw had a language metadata set to a random string 8kB long with various 
control chars and '\0'-s.

* again, if you don't sanitize and control the total number of unique values in 
the source field, you could end up with a number of fields approaching 
infinity, and Solr would melt down...

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Build failed in Hudson: Nutch-trunk #1280

2010-10-19 Thread Andrzej Bialecki
On 2010-10-19 06:01, Apache Hudson Server wrote:

 [Nutch-trunk] $ /bin/bash -xe /tmp/hudson7277994413075810777.sh
 + 
 PATH=/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/ucb:/usr/local/bin:/usr/bin:/usr/sfw/bin:/usr/sfw/sbin:/opt/sfw/bin:/opt/sfw/sbin:/opt/SUNWspro/bin:/usr/X/bin:/usr/ucb:/usr/sbin:/usr/ccs/bin
 + export ANT_HOME=/export/home/hudson/tools/ant/latest
 + ANT_HOME=/export/home/hudson/tools/ant/latest
 + export PATH ANT_HOME
 + cd trunk
 + /export/home/hudson/tools/ant/latest/bin/ant -Dversion=2010-10-19_04-00-41 
 -Dtest.junit.output.format=xml nightly
 /tmp/hudson7277994413075810777.sh: line 7: 
 /export/home/hudson/tools/ant/latest/bin/ant: No such file or directory

Do you know guys why the automated builds are failing? Looks like Ant is
not where the build script expects it to be...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-921:


Attachment: NUTCH-921.patch

Patch that implements reading config parameters from Configuration, and falls 
back to config files if Configuration properties are unspecified.

 Reduce dependency of Nutch on config files
 --

 Key: NUTCH-921
 URL: https://issues.apache.org/jira/browse/NUTCH-921
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-921.patch


 Currently many components in Nutch rely on reading their configuration from 
 files. These files need to be on the classpath (or packed into a job jar). 
 This is inconvenient if you want to manage configuration via API, e.g. when 
 embedding Nutch, or running many jobs with slightly different configurations.
 This issue tracks the improvement to make various components read their 
 config directly from Configuration properties.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610
 ] 

Andrzej Bialecki  commented on NUTCH-913:
-

There are formatting issues in DomainStatistics.java - the file uses literal 
tabs, which we frown upon, but the patch introduces double-space indent in the 
changed lines. As ugly as it sounds I think this should be changed into tabs, 
and then reformatted in another commit.

Other than that, +1, go for it.

 Nutch should use new namespace for Gora
 ---

 Key: NUTCH-913
 URL: https://issues.apache.org/jira/browse/NUTCH-913
 Project: Nutch
  Issue Type: Bug
  Components: storage
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: NUTCH-913_v1.patch


 Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace 
 from org.gora to org.apache.gora. This means nutch should use the new 
 namespace otherwise it won't compile with newer builds of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic 
though. If we didn't already use crawlId I would vote for that (and then rename 
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. 
list and delete at least (create is implicit). There is no such functionality 
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the 
datasetId as a Job property - this way indexing filter plugins can use this 
property to populate NutchDocument fields if needed. FWIW, this may be a good 
idea to do in other jobs as well...

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-907.patch


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

Doğacan, I missed your previous comment... the issue with partial bloom filters 
is usually solved that each task stores each own filter - this worked well for 
MapFile-s because they consisted of multiple parts, so then a Reader would open 
a part and a corresponding bloom filter.

Here it's more complicated, I agree... though this reminds me of the situation 
that is handled by DynamicBloomFilter: it's basically a set of Bloom filters 
with a facade that hides this fact from the user. Here we could construct 
something similar, i.e. don't merge partial filters after closing the output, 
but instead when opening a Reader read all partial filters and pretend they are 
one.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: hostdb.patch, NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

I think the difficulty comes from the simplification in 2.x as compared to 1.x, 
in that we keep a single status per page. In 1.x a side-effect of having two 
locations with two statuses (one db status in crawldb and one fetch status 
in segments) was that we had more information in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary 
fetch status, as determined by fetcher, and a final (reconciled) status as 
determined by updatedb, based on the knoweldge of not only plain fetch status 
and old status but also possible redirects. If I'm not mistaken currently the 
status is immediately overwritten by fetcher, even before we come to updatedb, 
hence the problem..

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki

On 2010-09-24 04:38, Mattmann, Chris A (388J) wrote:

Hi Nutch PMC:

/nudge

Anyone get a chance to review this yet? I have some free cycles tomorrow
and would really think it’s cool if I could finally push out the 1.2 RC.


I had little time this week, but I'm testing it now... I should be done 
tomorrow.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki

On 2010-09-24 20:40, Mattmann, Chris A (388J) wrote:

Thanks Andrzej, appreciate it. I know you’ve been really vigilant with
the other RCs I’ve thrown up about testing and I appreciate it. Other
Nutch PMC’ers: just need one more VOTE. Help, please? :)


+1, all unit tests pass, and a test crawl + indexing to Solr went just fine.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118
 ] 

Andrzej Bialecki  commented on NUTCH-880:
-

bq. I think we can combine the approach you outlined in NUTCH-907 with this one.

I'm not sure... they are really not the same things - you can execute many 
crawls with different seed lists, but still using the same Configuration.

bq. What is CLASS ?

It's the same as bin/nutch fully.qualified.class.name, only here I require that 
it implements NutchTool.

bq. Btw, Andrzej, I will be happy to help out with the implementation if you 
want.

By all means - I didn't have time so far to progress beyond this patch...

 REST API (and webapp) for Nutch
 ---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site

2010-09-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474
 ] 

Andrzej Bialecki  commented on NUTCH-909:
-

bq. It might be better to see the message Search with Apache Solr (as on the 
TIKA's site).

Yes, let's make this uniform.

 Add alternative search-provider to Nutch site
 -

 Key: NUTCH-909
 URL: https://issues.apache.org/jira/browse/NUTCH-909
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Priority: Minor
 Attachments: NUTCH-909.patch


 Add additional search provider (to existed Lucid Find) search-lucene.com. 
 Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1
 According to Andrzej's suggestion, when preparing the patch let's follow the 
 same rationales as those in TIKA-488, since they are applicable here too, so 
 please refer to that issue for more insight on implementation details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-862:
---

Assignee: Andrzej Bialecki 

 HttpClient null pointer exception
 -

 Key: NUTCH-862
 URL: https://issues.apache.org/jira/browse/NUTCH-862
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: linux, java 6
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: NUTCH-862.patch


 When re-fetching a document (a continued crawl) HttpClient throws an null 
 pointer exception causing the document to be emptied:
 2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
 http://localhost/doc/selfhtml/html/index.htm
 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:138)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
 2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
 http://localhost/doc/selfhtml/html/index.htm failed with: 
 java.lang.NullPointerException
 Because the document is re-fetched the server answers 304 (not modified):
 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] GET /doc/selfhtml/html/index.htm 
 HTTP/1.0 304 174 - Nutch-1.0
 No content is sent in this case (empty http body).
 Index: 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 ===
 --- 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 (revision 979647)
 +++ 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 (working copy)
 @@ -134,7 +134,8 @@
  if (code == 200) throw new IOException(e.toString());
  // for codes other than 200 OK, we are fine with empty content
} finally {
 -in.close();
 +if (in != null)
 +  in.close();
  get.abort();
}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-906.
-

Fix Version/s: 1.2
   Resolution: Fixed

Fixed in rev. 998261. Thanks!

 Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names 
 not being valid XML tag names
 

 Key: NUTCH-906
 URL: https://issues.apache.org/jira/browse/NUTCH-906
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.1
 Environment: Debian GNU/Linux 64-bit
Reporter: Asheesh Laroia
Assignee: Andrzej Bialecki 
 Fix For: 1.2

 Attachments: 
 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch

   Original Estimate: 0.33h
  Remaining Estimate: 0.33h

 The Nutch FAQ explains that OpenSearch includes all fields that are 
 available at search result time. However, some Lucene column names can start 
 with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch 
 results for a document with a Lucene document column whose name starts with 
 numbers, the underlying Xerces library throws this exception: 
 org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML 
 character is specified. 
 So I have written a patch that tests strings before they are used to generate 
 tags within OpenSearch.
 I hope you merge this, or a better version of the patch!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

That's very good news - in that case I'm fine with the Gora API as it is now, 
we should change Nutch to make use of this functionality.

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Attachment: API.patch

Initial patch for discussion. This is a work in progress, so only some 
functionality is implemented, and even less than that is actually working ;)

I would appreciate a review and comments.

 REST API (and webapp) for Nutch
 ---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-15 Thread Andrzej Bialecki (JIRA)
DataStore API doesn't support multiple storage areas for multiple disjoint 
crawls
-

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0


In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
page data, linkdb, etc) by specifying a path where the data was stored. This 
enabled users to run several disjoint crawls with different configs, but still 
using the same storage medium, just under different paths.

This is not possible now because there is a 1:1 mapping between a specific 
DataStore instance and a set of crawl data.

In order to support this functionality the Gora API should be extended so that 
it can create stores (and data tables in the underlying storage) that use 
arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API 
should be extended to allow passing this crawlId value to select one of 
possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-09-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

+1 to NutchContext. See also NUTCH-907 because the changes required in Gora API 
will likely make this task easier (once implemented ;) ).

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >