from:"Andrzej Bialecki"


 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: NUTCH-837.patch

Updated patch against r959954 (after NUTCH-836).

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch, NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies


 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: (was: NUTCH-837.patch)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies


[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729
 ] 

Andrzej Bialecki  commented on NUTCH-837:
-

bq. So, I think we should still have a Nutch webapp and in my mind it's a 
must-have for a 2.0 release...

I agree. But for the moment it's better to delete the old webapp stuff that we 
know for sure doesn't work with the current Nutch, and it will be completely 
reimplemented anyway.

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies


 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-837.
-

Resolution: Fixed

Committed in r960064. Thanks for review!

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-821) Use ivy in nutch builds


[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885188#action_12885188
 ] 

Andrzej Bialecki  commented on NUTCH-821:
-

I think this patch refers to some parts that were already removed in NUTCH-837 
...

Also, it would be nice to have a target that sets up an Eclipse project - after 
this patch is applied the lib/ is nearly empty and you need to run build at 
least once to bring dependencies - this may be confusing.

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-696) Timeout for Parser


 [ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-696:


Attachment: timeout.patch

A simple patch that implements the strategy outlined here http://bit.ly/bdTYrS 
- I've been recently suffering from this issue, so this is better than nothing. 
Julien's strategy would work, too, but then the job takes much longer to 
execute.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-696) Timeout for Parser


[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885257#action_12885257
 ] 

Andrzej Bialecki  commented on NUTCH-696:
-

Yes - this patch is a quick solution that allowed me to complete a crawl. If 
people feel this is useful, let's polish it.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (NUTCH-696) Timeout for Parser


 [ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-696:
-


This may be useful after all - let's gather more comments.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-696) Timeout for Parser


[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885295#action_12885295
 ] 

Andrzej Bialecki  commented on NUTCH-696:
-

I agree, ultimately that's the way to go. However, I needed something _now_, 
and the patch helps to solve the problem that I have now - and until this 
problem is solved in Tika this patch provides some kind of band-aid for us poor 
Nutch-ers...

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-06 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885583#action_12885583
 ] 

Andrzej Bialecki  commented on NUTCH-821:
-

+1 for this patch for now - all good comments, there's plenty of improvements 
we can make, so let's line them up as separate issues.

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-843) Separate the build and runtime environments

Separate the build and runtime environments
---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Currently there is no clean separation of source, build and runtime artifacts. 
On one hand, it makes it easier to get started in local mode, but on the other 
hand it makes the distributed (or pseudo-distributed) setup much more 
challenging and tricky. Also, some resources (config files and classes) are 
included several times on the classpath, they are loaded under different 
classloaders, and in the end it's not obvious what copy and why takes 
precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop 
daemons (jobtracker and tasktracker) will get conf/ and build/ on their 
classpath. This means that a task running on this cluster will have two copies 
of resources from these locations - one from the inherited classpath from 
tasktracker, and the other one from the just unpacked nutch.job file. If these 
two versions differ, only the first one will be loaded, which in this case is 
the one taken from the (unpacked) conf/ and build/ - the other one, from within 
the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be 
shipped to the new nodes as a part of each task setup, but now the remote 
tasktracker child processes will use resources from nutch.job - so some tasks 
will use different versions of resources than other tasks. This usually leads 
to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following 
areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to 
this category too, so there will be no top-level bin/. nutch-default.xml 
belongs to this category too. Other customizable files can be moved to src/conf 
too, or they could stay in top-level conf/ as today, with a README that 
explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run 
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
(installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are 
already included in the job jar. These resources can be copied directly to the 
master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the 
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to 
run in the local mode. We may later on refine this local setup to something 
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which 
actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-843:

Attachment: NUTCH-843.patch

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and
/runtime/local areas, populated with the right pieces. bin/nutch has been
modified to work correctly in both cases.

Separate the build and runtime environments
---

Key: NUTCH-843
URL: https://issues.apache.org/jira/browse/NUTCH-843
Project: Nutch
Issue Type: Improvement
Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
Attachments: NUTCH-843.patch

Currently there is no clean separation of source, build and runtime
artifacts. On one hand, it makes it easier to get started in local mode, but
on the other hand it makes the distributed (or pseudo-distributed) setup much
more challenging and tricky. Also, some resources (config files and classes)
are included several times on the classpath, they are loaded under different
classloaders, and in the end it's not obvious what copy and why takes
precedence.
Here's an example of a harmful unintended behavior caused by this mess:
Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on
their classpath. This means that a task running on this cluster will have two
copies of resources from these locations - one from the inherited classpath
from tasktracker, and the other one from the just unpacked nutch.job file. If
these two versions differ, only the first one will be loaded, which in this
case is the one taken from the (unpacked) conf/ and build/ - the other one,
from within the nutch.job file, will be ignored.
It's even worse when you add more nodes to the cluster - the nutch.job will
be shipped to the new nodes as a part of each task setup, but now the remote
tasktracker child processes will use resources from nutch.job - so some tasks
will use different versions of resources than other tasks. This usually leads
to a host of very difficult to debug issues.
This issue proposes then to separate these environments into the following
areas:
* source area - i.e. our current sources. Note that bin/ scripts will belong
to this category too, so there will be no top-level bin/. nutch-default.xml
belongs to this category too. Other customizable files can be moved to
src/conf too, or they could stay in top-level conf/ as today, with a README
that explains that changes made there take effect only after you rebuild the
job jar.
* build area - contains build artifacts, among them the nutch.job jar.
* runtime (or deploy) area - this area contains all artifacts needed to run
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster
(installed from plain vanilla Hadoop release) this will be a {{/deploy}}
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are
already included in the job jar. These resources can be copied directly to
the master Hadoop node.
For a local setup (using LocalJobTracker) this will be a {{/runtime}}
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that
the plugins/ directory be unpacked from the job jar. And we need the hadoop
libs to run in the local mode. We may later on refine this local setup to
something like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar
(which actually would not be used in this case).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015
]

Andrzej Bialecki commented on NUTCH-843:
-

We need to create the job file anyway. Actually, the patch I attached does
something like this for the local setup (lib/ is flattened), but still I would
argue for setting up two areas, /runtime/deploy and /runtime/local - it's
painfully obvious then what parts you need to deploy to a Hadoop cluster.

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-843:

Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

Separate the build and runtime environments
---

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-844) Improve NutchConfiguration


 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-844:


Attachment: conf.patch

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886318#action_12886318
]

Andrzej Bialecki commented on NUTCH-843:
-

runtime/local doesn't need Hadoop scripts, by definition it uses local FS and
local job tracker, so Hadoop scripts are of no use. Native libs .. see
NUTCH-845.

Separate the build and runtime environments
---

Attachments: NUTCH-843.patch, NUTCH-843.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

[
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886330#action_12886330
]

Andrzej Bialecki commented on NUTCH-843:
-

Pseudo-distributed (i.e. a real JobTracker with a single TaskTracker) suffers
from the same classpath issues that I described above, so even in such case
it's best to run jobs in a separate environment, using /runtime/deploy
artifacts.

Separate the build and runtime environments
---

Attachments: NUTCH-843.patch, NUTCH-843.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-845) Native hadoop libs not available through maven


 [ 
https://issues.apache.org/jira/browse/NUTCH-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-845.
-

Fix Version/s: 2.0
   Resolution: Fixed

Committed in rev. 961778. Thanks for review!

 Native hadoop libs not available through maven
 --

 Key: NUTCH-845
 URL: https://issues.apache.org/jira/browse/NUTCH-845
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 There are no maven artifacts for the native libs (I verified this on Hadoop 
 ML). I think it's better to delete the libs, after all we don't want to keep 
 bits and pieces of dependencies in our svn, but let's leave a placeholder and 
 a README that explains how to get them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-844:


Attachment: NUTCH-844.patch

Updated patch. This also addresses an issue in PluginRepository that uses 
Configuration as a key in its internal cache - the problem though is that 
Configuration doesn't implement hashCode, so the cache would have been 
ineffective in situations like this:
{code}
Configuration conf = NutchConfiguration.create();
PluginRepository repo1 = PluginRepository.get(conf);
JobConf job = new NutchJob(conf);
PluginRepository repo2 = PluginRepository.get(job);
// repo2 is a new instance, but should be the same instance!
{code}

The new code sets a UUID property, so the cache knows it's still the same 
instance. There's a new unit test to ensure this works properly when using 
NutchConfiguration.create(), and illustrates that it fails without the uuid.

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch, NUTCH-844.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-844.
-

Resolution: Fixed

Committed in r964063. Thanks for review!

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch, NUTCH-844.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-858:


 Assignee: Andrzej Bialecki 
Fix Version/s: 1.2

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890873#action_12890873
 ] 

Andrzej Bialecki  commented on NUTCH-858:
-

Unfortunately no. The patch was included in a fix to NUTCH-837, which is 
relative to trunk, and it's not directly applicable to 1.x, needs to be ported.

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-863) Benchmark and a testbed proxy server

2010-07-30 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-863.
-

Fix Version/s: 2.0
   Resolution: Fixed

Committed in rev. 980932.

 Benchmark and a testbed proxy server
 

 Key: NUTCH-863
 URL: https://issues.apache.org/jira/browse/NUTCH-863
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: proxy.patch


 This issue adds two components:
 * a testbed proxy server that can serve various content: pre-fetched Nutch 
 segments, forward requests to original URLs, or create a lot of unique but 
 predictable fake content (with outlinks) on the fly.
 * a simple Benchmark class to measure the time taken to complete several 
 crawl cycles using fake content.
 * 'ant proxy' and 'ant benchmark' targets to execute a benchmark run.
 Together these tools should provide a more or less objective method to 
 measure the end-to-end crawl performance. This initial version can be further 
 instrumented to collect statistics about various stages of data processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-07-31 Thread Andrzej Bialecki (JIRA)

Port Nutch benchmark to Nutchbase
-

 Key: NUTCH-867
 URL: https://issues.apache.org/jira/browse/NUTCH-867
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchbase
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: nutchbase


Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the 
Nutchbase branch vs. trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-08-04 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895377#action_12895377
 ] 

Andrzej Bialecki  commented on NUTCH-858:
-

It was r960064, but I have to admit I sneaked in this improvement as a part of 
NUTCH-837, which contained a lot of other stuff...

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-08-04 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-867:


Attachment: benchmark.patch

Ported benchmark that uses HSQLDB as the store impl. If there are no objections 
I'll commit it shortly.

 Port Nutch benchmark to Nutchbase
 -

 Key: NUTCH-867
 URL: https://issues.apache.org/jira/browse/NUTCH-867
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchbase
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: nutchbase

 Attachments: benchmark.patch


 Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the 
 Nutchbase branch vs. trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-876:


Attachment: NUTCH-876.patch

Patch to fix the issue. If there are no objections I'll commit this shortly.

 Remove remaining robots/IP blocking code in lib-http
 

 Key: NUTCH-876
 URL: https://issues.apache.org/jira/browse/NUTCH-876
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-876.patch


 There are remains of the (very old) blocking code in 
 lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage 
 politeness limits. New trunk doesn't have OldFetcher anymore, so this code is 
 useless. Furthermore, there is an actual bug here - FetcherJob forgets to set 
 Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults 
 in lib-http are set to true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-879) URL-s getting lost

2010-08-10 Thread Andrzej Bialecki (JIRA)

URL-s getting lost
--

 Key: NUTCH-879
 URL: https://issues.apache.org/jira/browse/NUTCH-879
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
* using 1-node Hadoop + HDFS
* trunk r983472, using MySQL store
* branch-1.3
Reporter: Andrzej Bialecki 


I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
urls, while trunk collects ~20,000 urls. Clearly something is wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

[
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki updated NUTCH-880:

Description:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling requests and returning
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This would
have to be an async API, since all Nutch operations take long time to execute.
It follows then that we need to be able also to list running operations,
retrieve their current status, and possibly
abort/cancel/stop/suspend/resume/...? This also means that we would have to
potentially create manage many threads in a servlet - AFAIK this is frowned
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops
on them? this would be nice, because it would allow managing of several
different crawls, with different configs, in a single webapp - but it
complicates the implementation a lot.

was:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would
have to be an async API, since all Nutch operations take long time to execute.
It follows then that we need to be able also to list running operations,
retrieve their current status, and possibly
abort/cancel/stop/suspend/resume/...? This also means that we would have to
potentially create manage many threads in a servlet - AFAIK this is frowned
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job
content), with the restlet servlet as an entry point.

Open issues:

REST API (and webapp) for Nutch
---

Key: NUTCH-880
URL: https://issues.apache.org/jira/browse/NUTCH-880
Project: Nutch
Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki

This issue is for discussing a REST-style API for accessing Nutch.
Here's an initial idea:
* I propose to use org.restlet for handling requests and returning
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This
would have to be an async API, since all Nutch operations take long time to
execute. It follows then that we need to be able also to list running
operations, retrieve their current status, and possibly
abort/cancel/stop/suspend/resume/...? This also means that we would have to
potentially create manage many threads in a servlet - AFAIK this is frowned
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job
content), with the restlet servlet as an entry point.
Open issues:
* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or
should we have a notion of crawl contexts (sets of crawl configs) with CRUD
ops on them? this would be nice, because it would allow managing of several
different crawls, with different configs, in a single webapp - but it
complicates the implementation a lot.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default

FetcherJob should run more reduce tasks than default


 Key: NUTCH-884
 URL: https://issues.apache.org/jira/browse/NUTCH-884
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


FetcherJob now performs fetching in the reduce phase. This means that in a 
typical Hadoop setup there will be many fewer reduce tasks than map tasks, and 
consequently the max. total throughput of Fetcher will be proportionally 
reduced. I propose that FetcherJob should set the number of reduce tasks to the 
number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE


 [ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-872.
-

Fix Version/s: 2.0
   Resolution: Fixed

I changed the name of the option to -parse to be consistent with the 
nutch-default.xml naming. I also updated the API to use this name, it's less 
confusing this way.

Committed in rev. 984401. Thanks for the feedback.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default