Re: Nutch 2.0

2010-06-28 Thread Andrzej Bialecki
On 2010-06-28 07:49, Sami Siren wrote:
 One aspect that has not been discussed yet is the legal aspect.
 According to
 http://incubator.apache.org/ip-clearance/index.html there is a formal
 process for integrating externally development efforts that have
 happened outside of Apache. Should we be following the ip clearance
 process in this case too?

The concept of a substantial contribution that should be subject to a
software grant is somewhat tenuous, though. Keep in mind that you do
something equivalent in JIRA already - when you check the Grant license
to ASF box you perform a micro-grant. So the question is whether we
should go through a full grant or through the JIRA micro-grant.

In my opinion it's ok to do the latter, since much of the code is simply
a modified version of Nutch classes - not counting GORA, of course, but
that part will be added as a third-party lib. So IMHO it's enough to zip
all source (without libs), attach it to a JIRA issue and mark the
checkbox. Then we follow the process outlined by Chris, which imports
the same codebase into our svn. What do you think?

If folks agree that this is sufficient, then Dogacan  Enis - can you
please create a separate JIRA issue, prepare a patch like this, mark the
checkbox, and list all dependencies and their licenses for those that
are not already in Nutch svn?

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Where is nutch 2.0

2010-06-29 Thread Andrzej Bialecki
On 2010-06-29 11:17, Raghavendra Neelekani wrote:
 Hi
 
 Can you please tell me from where can I download nutch 2.0 .?

Nutch 2.0 is in the planning and early development phase, so it can't be
downloaded yet. We hope to produce a working Nutch 2.0 some time in Q4 2010.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Andrzej Bialecki

Hi,

(This question is mostly to Dogacan  Enis, but I encourage anyone 
familiar with the code to join the threads with [Nutchbase] - the sooner 
the better ;) ).


I'm looking at src/gora/webpage.avsc and WebPage.java  friends... 
presumably the java code was autogenerated from avsc using Gora? If so, 
we should put this autogeneration step in our build.xml. Or am I missing 
something?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Minimizing the number of stored fields for Solr

2010-07-03 Thread Andrzej Bialecki

On 2010-07-03 10:00, Doğacan Güney wrote:

Hey everyone,

This is not really a proposition but rather something I have been wondering
for a while so I wanted to see what everyone is
thinking.

Currently in our solr backend, we have stored=true indexed=false fields
and stored=true indexed=true fields. The former
class of fields are mostly used for storing digest, caching information etc.
I suggest that we get rid of all indexed=false fields and
read all such data from storage backend.

For the latter class of fields (i.e., stored=true indexed=true), I suggest
that we set them to stored=false for everything but id field. As an
example currently title is stored/indexed in solr while text is only indexed
(thus, will need to be fetched from storage backend). But for hbase
backend, title and text are already stored close together (in the same
column family) so performance hit of reading just text or reading both
will likely be same. And removing storage from solr may lead to better
caching of indexed fields and may lead to better example.

What does everyone think?



The issue is not as simple as it looks. If you want to have a good 
performance for searching  snippet generation then you still need to 
store some data in stored fields - at least url, title, and plain text 
(not to mention the option to use term vectors in order to speed up the 
snippet generation). Solr functionality can be also impaired by a lack 
of data available directly from Lucene storage (field cache, faceting, 
term vector highlighting).


Some fields of course are not useful for display, but are used for 
searching only (e.g. anchors). These should be indexed but not stored in 
Solr. And it's ok to get them from non-solr storage if requested, 
because it's a rare event. The same goes for the full raw content, if 
you want to offer a cached view - this should not be stored in Solr 
but instead it should come from a separate layer (note that sometimes 
cached view might not be in the original format - pdf, office, etc - and 
instead an html representation may be more suitable, so in general the 
cached view shouldn't automatically equal the original raw content).


But for other fields I would argue that for now they should remain 
stored in Solr, *even the full text*, until we figure out how they 
affect the ability and performance of common search operations. E.g. if 
we remove the stored title field then we need to reach to the storage 
layer in order to display each page of results... not to mention issues 
like highlighting, faceting, function queries and a host of other 
functionalities that Solr can offer just because a field is stored in 
its index.


So I'm -0 to this proposal - of course we should review our schema, and 
of course we should have a mechanism to get data from the storage layer, 
but what you propose is IMHO a premature optimization at this point.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



YCSB benchmark for KV stores

2010-07-03 Thread Andrzej Bialecki

Hi,

Found this link:

http://wiki.github.com/brianfrankcooper/YCSB/papers-and-presentations

Would be cool to run the benchmark for the same stores but via Gora.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Parse-tika ignores too much data...

2010-07-08 Thread Andrzej Bialecki

On 2010-07-07 22:32, Ken Krugler wrote:

Hi Julien,


See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way body is handled, we also saw cases were it was
twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a body OR a frameset, but not both.


The HTML was broken on purpose - one of the goals of the original test 
was to get as much content and links in presence of grave errors - as 
you know even major sites often produce a badly broken HTML, but the 
parser sanitize it and produce a valid DOM. In this case, it produced 
two nested body elements, which is not valid. I should also mention 
that NekoHTML handled this test much better, by removing the body and 
retaining only the frameset.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Merging in nutchbase

2010-07-10 Thread Andrzej Bialecki

On 2010-07-10 15:00, Doğacan Güney wrote:

Hey everyone,

I would like to start merging in nutchbase to trunk, so I am hoping to get
everyone's comments and suggestions on
how to do that.



Do we have any way to run the merged code without running HBase? I think 
that the SQL backend to Gora needs to be tested first with the nutchbase 
branch - otherwise the development and testing will become very 
difficult... So in my opinion we need to make sure we can use a small 
SQL backend (Derby or HSQL) before we start merging.


As for the mechanics of the patching - yes, I think it needs to be done 
this way.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Merging in nutchbase

2010-07-10 Thread Andrzej Bialecki

On 2010-07-10 17:01, Doğacan Güney wrote:

Hey everyone,

On Sat, Jul 10, 2010 at 17:43, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov  wrote:


  Hey Guys,

+1 to Andrzej’s suggestion. I mostly run small scale stuff with Nutch, so
unless I can run HBase in small scale (or better yet, an embedded SQL db), I
won’t be as much use! :)



I just want to make clear that this is, indeed, a goal I share. Gora already
has an SQL backend that can use embedded hsqldb. However, there are some
weird bugs (I really hate SQL :), but once I am done fixing all bugs (which
I will be doing today and tomorrow), nutch will run on gora - (embedded
hsqldb) with zero configuration.


Excellent, that would be a real breakthrough.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Andrzej Bialecki
On 2010-07-20 14:30, Ferdy wrote:
 Hello,
 
 We are currently using a heavily modified version of nutch. The main
 reason for this is the fact that we do not only fetch the urls that the
 QueueFeeder submits, but also additional resources from urls that are
 constructed during parsing. So for example let's say the QueueFeeder
 submits a html page to the fetcher, and after the fetch the page gets
 parsed. Nothing special so far. However the parser decides it also needs
 some images on the page. Perhaps these images link to other html pages,
 and we might want to fetch these too. All this is needed to parse
 information about this particular url we started with. These extra fetch
 urls we like to call Components, because they are additional resources
 required to do the parsing of our initial html page that was selected
 for fetching.
 
 At first we tried to solve this vertical crawling problem by using
 multiple crawl cycles. Each crawl simply selects outlinks that are
 needed for the parsing of the initial html page. A single inspection can
 possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
 depth). There are several problems with this approach, for one that the
 crawldb is cluttered with all these component urls and secondly that
 inspection completion times can be very long.
 
 As an alternative we decided to let the parser fetch needed components
 on-the-fly, so that additional urls are instantly added to the fetcher
 lists. Every fetched url can be either a non-component (the QueueFeeder
 fed it; start parsing this resource) or as a component (the fetcher
 hands the resource over to the parser that requested it). In order to
 keep parsers alive we always try to fetch components first, with respect
 to fetch politeness. A downside of this solution is that your fetch task
 total running time will be more difficult to anticipate to. For example,
 if you inject and generate 100 urls and they will be fetched in a single
 task, you might end up fetching a total of 1100 urls (in the assumption
 each inspection needs 10 components). We found this behaviour to be
 acceptable.
 
 Because of our custom version of nutch we cannot upgrade easily to newer
 versions (we're still using modified fetcher classes from nutch 0.9).
 Often we end up fixing bugs that have already been fixed by the
 community. Also, other users might benefit from our changes too.
 
 Therefore we propose to redesign our vertical crawling system from
 scratch for the newer nutch versions, should there be any interest from
 the community. Perhaps we are not the only one to implement such a
 system with nutch. So, what are your thoughts about this?

If I understand your use case properly, this is really a custom Fetcher
that you are talking about - a strategy to fetch complete pages
(together with its resources that relate to the display of the page)
should be possible to implement in a custom fetcher without changing
other Nutch areas.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutchbase merge strategy

2010-07-21 Thread Andrzej Bialecki

On 2010-07-21 21:12, Mattmann, Chris A (388J) wrote:

Hey Andrzej,


+1 to all of the above - see below.



So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen
over the next few weeks. WDYT?


This is a serious move - let's wait a bit, say until Monday, to give
chance to others to comment.


Agreed. Let's wait until Monday. If there aren't any objections, let's let
er' rip!

BTW, #4 is independent of #1-3. WDYT about wrapping up the 1.x series of
Nutch and rolling a 1.2 in the next few days (while I have some free
cycles)? :) #4 is also in its own branch and therefore independent as well
so it won't be as brave a move.

Let me know what you (all) think.


If 1.2 is going to be the last release in 1.x series then I think we 
should review some pending issues, especially those reported after 1.0 
release:


https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=10680updated%3Aprevious=-1wcreated%3Aafter=1%2FApr%2F09status=1status=3status=4sorter/field=updatedsorter/order=DESC

Actually, just two issues are still unresolved... hmm, not bad.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[Nutchbase] Multi-value ParseResult missing

2010-07-21 Thread Andrzej Bialecki

Hi,

I noticed that nutchbase doesn't use the multi-valued ParseResult, 
instead all parse plugins return a simple Parse. As a consequence, it's 
not possible to return multiple values from parsing a single WebPage, 
something that parsers for compound documents absolutely require 
(archives, rss, mbox, etc). Dogacan - was there a particular reason for 
this change?


However, a broader issue here is how to treat compound documents, and 
links to/from them:
 a) record all URLs of child documents (e.g. with the !/ notation, or # 
notation), and create as many WebPage-s as there were archive members. 
This needs some hacks to prevent such urls from being scheduled for 
fetching.
 b) extend WebPage to allow for multiple content sections and their 
names (and metadata, and ... yuck)
 c) like a) except put a special synthetic mark on the page to 
prevent selection of this page for generation and fetching. This mark 
would also help us to update / remove obsolete sub-documents when their

container changes.

I'm leaning towards c).

Now, when it comes to the ParseResult ... it's not an ideal solution 
either, because it means we have to keep all sub-document results in 
memory. We could avoid it by implementing something that Aperture uses, 
which is a sub-crawler - a concept of a parser plugin for compound 
formats. The main plugin would return a special result code, which 
basically says this is a compound format of type X, and then the 
caller (ParseUtil?) would use SubCrawlerFactory.create(typeX, 
containerDataStream) to create a parser for the container. This parser 
in turn would simply extract sections of the compound document (as 
streams) and it would pass each stream to the regular parsing chain. The 
caller then needs to iterate over results returned from the SubCrawler. 
What do you think?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Benchmark of Nutch trunk

2010-07-30 Thread Andrzej Bialecki

Hi,

We have a simple crawling benchmark now in trunk. Here's how to use it:

* in one console execute 'ant proxy'. This will start on port 8181 a 
proxy server that produces fake pages.


* in another console execute 'ant benchmark'. This will run 5 rounds of 
fetching (~16,000 pages) using that proxy server.


There are already some interesting issues I noticed. First, on a 
reasonably good hardware in local mode I was able to fetch and process 
(NOTE: this includes ALL steps, i.e. generate, fetch, parse, crawldb 
update and invertlinks) 16k pages in 400 sec. This means a total 
crawling throughput of 40 pages/sec. This is in local mode, so in 
distributed mode I guess we would be getting this number times the 
number of tasks.


Secondly, it seems that Fetcher has some synchronization issues in its 
queue management - even if other queues are non-empty, but one of the 
queues blocks, the Fetcher will spin-wait all threads until an item 
becomes available on that queue, and then it starts to happily consume 
items from all non-blocking queues (including this one). The process 
then repeats - one queue blocks, and all threads stop getting items from 
other queues... At the moment I can't figure out where this lock-up is 
happening, but the symptoms are obvious when you look at the logs in 
real-time.


More stuff to come on this subject - at least we have a tool to 
experiment with :)


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Seeking Insight into Nutch Configurations

2010-08-02 Thread Andrzej Bialecki

On 2010-08-02 10:17, Scott Gonyea wrote:

The big problem that I am facing, thus far, occurs on the 4th fetch.
All but 1 or 2 maps complete. All of the running reduces stall (0.00
MB/s), presumably because they are waiting on that map to finish? I
really don't know and it's frustrating.


Yes, all map tasks need to finish before reduce tasks are able to 
proceed. The reason is that each reduce task receives a portion of the 
keyspace (and values) according to the Partitioner, and in order to 
prepare a nice key, list(value) in your reducer it needs to, well, get 
all the values under this key first, whichever map task produced the 
tuples, and then sort them.


The failing tasks probably fail due to some other factor, and very 
likely (based on my experience) the failure is related to some 
particular URLs. E.g. regex URL filtering can choke on some pathological 
URLs, like URLs 20kB long, or containing '\0' etc, etc. In my 
experience, it's best to keep regex filtering to a minimum if you can, 
and use other urlfilters (prefix, domain, suffix, custom) to limit your 
crawling frontier. There are simply too many ways where a regex engine 
can lock up.


Please check the logs of the failing tasks. If you see that a task is 
stalled you could also log in to the node, and generate a thread dump a 
few times in a row (kill -SIGQUIT pid) - if each thread dump shows the 
regex processing then it's likely this is your problem.



My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text
is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I
thought S3 would be more performant, yet it doesn't appear to affect
matters. Cost vs Speed: I don't mind throwing EC2 instances at this
to get it done quickly... But I can't imagine I need much more than
10-20 mid-size instances for this.


That's correct - with this number of unique sites the max. throughput of 
your crawl will be ultimately limited by the politeness limits (# of 
requests/site/sec).




Can anyone share their own experiences in the performance they've
seen?


There is a very simple benchmark in trunk/ that you could use to measure 
the raw performance (data processing throughput) of your EC2 cluster. 
The real-life performance, though, will depend on many other factors, 
such as the number of unique sites, their individual speed, and (rarely) 
the total bandwidth at your end.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Seeking Insight into Nutch Configurations

2010-08-02 Thread Andrzej Bialecki

On 2010-08-02 22:59, Scott Gonyea wrote:

By the way, can anyone tell me if there is a way to explicitly limit how
many pages should be fetched, per fetcher-task?


I believe that in general case it would be a very complex problem to 
solve so that you get exact results. The reason is that Nutch doesn't 
use any global lock manager, so the only way to ensure a proper per-host 
locking is to assign all URL-s from any given host to the same map task. 
This may (and often will) create an imbalance in the number of allocated 
URL-s per task.


One method to mitigate this imbalance is to set generate.max.count (in 
trunk, generate.max.per.host in 1.1) - this will limit the number of 
URL-s from any given host to X, thus helping in a more balanced mixing 
of these N per-host chunks across M maps.



I think part of the problem is that, seemingly, Nutch seems to be
generating some really unbalanced fetcher tasks.

The task (task_201008021617_0026_m_00) had 6859 pages to fetch.
  Each higher-numbered task had fewer pages to fetch.  Task 000180 only
had 44 pages to fetch.


There's no specific tool to examine the composition of fetchlist 
parts... try running this in the segments/2010*/crawl_generate/:


for i in part-00*
do
echo  part $i -
strings $i | grep http://
done

to print URL-s per map task. Most likely you will see that there was no 
other way to allocate the URLs per task to satisfy the constraint that I 
explained above. If it's not the case, then it's a bug. :)




This *huge* imbalance, I think, creates tasks that are seemingly
unpredictable.  All of my other resources just sit around, wasting
resources, until one task grabs some crazy number of sites.


Again, generate.max.count is your friend - even though you won't be able 
to get all pages from a big site in one go, at least your crawls will 
finish quickly and you will quickly progress breadth-wise, if not 
depth-wise.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[Nutchbase] jmxtools issue...

2010-08-04 Thread Andrzej Bialecki

Hi,

I can't compile nutchbase at the moment - ivy has trouble finding 
jmxri.jar and jmxtools.jar ... I found jmxri.jar somewhere and put it to 
my .ivy2/local, but I can't find jmxtools.jar ... Anyway, why do we need 
these two jars at all???


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Hsqldb 2.0 conflicts with Hsqldb 1.8 in Hadoop

2010-08-10 Thread Andrzej Bialecki

Hi,

I was trying to run Benchmark in trunk using MySQL, on a standalone 
Hadoop cluster. My conf/gora.properties has this:


gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?user=nutchpassword=nutch

Jobs were failing though, with the following:

Exception in thread main java.lang.NoSuchMethodError: 
org.hsqldb.DatabaseURL.parseURL(Ljava/lang/String;ZZ)Lorg/hsqldb/persist/HsqlProperties;

at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source)
at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.gora.sql.store.SqlStore.getConnection(SqlStore.java:712)
at org.gora.sql.store.SqlStore.initialize(SqlStore.java:145)
at 
org.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:64)
at 
org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:86)
at 
org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:98)
at 
org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:70)
at 
org.apache.nutch.storage.StorageUtils.createDataStore(StorageUtils.java:25)
at 
org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:68)
at 
org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50)

at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:237)
at org.apache.nutch.tools.Benchmark.benchmark(Benchmark.java:190)
at org.apache.nutch.tools.Benchmark.run(Benchmark.java:139)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.tools.Benchmark.main(Benchmark.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Isn't this puzzling... It turns out that java.sql.DriverManager will try 
_all_ drivers in turn to see which one can handle the jdbcUrl, and the 
usual magic of Class.forName(jdbcDriver) doesn't mean we are going to 
use jdbcDriver, it's just to make sure the driver class was loaded and 
registered itself on the list of available drivers.


Now, I know why the particular error occured - Hadoop includes HSQLDB 
1.8, and we use HSQLDB 2.0. When DriverManager tries each driver in 
turn, unfortunately Hsqldb is first on the classpath (it comes in 
Hadoop/lib), and MySQL is the last, so it bombs out even before trying 
the right driver...


For now I changed my build.xml to this:

Index: build.xml
===
--- build.xml   (revision 983564)
+++ build.xml   (working copy)
@@ -123,7 +123,7 @@
   excludes=nutch-default.xml,nutch-site.xml/
   zipfileset dir=${conf.dir} excludes=*.template,hadoop*.*/
   zipfileset dir=${build.lib.dir} prefix=lib
-  includes=**/*.jar excludes=hadoop-*.jar/
+  includes=**/*.jar excludes=hadoop-*.jar,hsqldb*.jar/
   zipfileset dir=${build.plugins} prefix=plugins/
 /jar
   /target



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Tika HTML parsing

2010-08-15 Thread Andrzej Bialecki

On 2010-08-15 06:54, Ken Krugler wrote:

For what it's worth, I just committed some patches to Tika that should
improve Tika's ability to extract HTML outlinks (in img and frame
elements, at least). Support for iframe should be coming soon :)

This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
tracking down, but I think Tika is getting closer to being usable by
Nutch for typical web crawling.


Thanks Ken for pushing forward this work! A few questions:

* does this include image maps as well (area)?

* how does the code treat invalid html with both body and frameset?

* what's the status of extracting the meta robots and link rel information?

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Alternative search box for Nutch site

2010-08-30 Thread Andrzej Bialecki

On 2010-08-30 12:21, Otis Gospodnetic wrote:

Hello peeps,

We've created a patch for Tika and got some good and constructive feedback (see
https://issues.apache.org/jira/browse/TIKA-488 ).

Should we follow the same functionality pattern for nutch.apache.org as seen in
TIKA-488?


Sure, why not - when preparing the patch let's follow the same 
rationales as those in TIKA-488, since they are applicable here too.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch 2.0 (trunk)

2010-09-07 Thread Andrzej Bialecki

On 2010-09-07 14:50, Faruk Berksöz wrote:

Dear all,

wenn i try to fetch a web page (e.g.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql
storage definition,
I am seeing the following error in my hadoop logs. ,  (no error with
hbase ) ;

java.io.IOException: java.sql.BatchUpdateException: Data truncation:
Data too long for column 'content' at row 1
 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
 at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
 at
org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
 at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
 at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

The type of the column 'content' is BLOB.
It may be important for the next developments of Gora.
Should I file this in nutch-jira or hithub/gora or nothing?

environments : ubuntu 10.04
JVM : 1.6.0_20
nutch 2.0 (trunk)
Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed


Yes, please create a JIRA issue. Thanks!



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Nutch 1.2 Release Candidate #1

2010-09-10 Thread Andrzej Bialecki

On 2010-08-09 16:45, Julien Nioche wrote:

I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be
good to fix it before releasing 1.2



This is fixed. How about doing the release now?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki

On 2010-09-24 04:38, Mattmann, Chris A (388J) wrote:

Hi Nutch PMC:

/nudge

Anyone get a chance to review this yet? I have some free cycles tomorrow
and would really think it’s cool if I could finally push out the 1.2 RC.


I had little time this week, but I'm testing it now... I should be done 
tomorrow.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki

On 2010-09-24 20:40, Mattmann, Chris A (388J) wrote:

Thanks Andrzej, appreciate it. I know you’ve been really vigilant with
the other RCs I’ve thrown up about testing and I appreciate it. Other
Nutch PMC’ers: just need one more VOTE. Help, please? :)


+1, all unit tests pass, and a test crawl + indexing to Solr went just fine.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Build failed in Hudson: Nutch-trunk #1280

2010-10-19 Thread Andrzej Bialecki
On 2010-10-19 06:01, Apache Hudson Server wrote:

 [Nutch-trunk] $ /bin/bash -xe /tmp/hudson7277994413075810777.sh
 + 
 PATH=/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/ucb:/usr/local/bin:/usr/bin:/usr/sfw/bin:/usr/sfw/sbin:/opt/sfw/bin:/opt/sfw/sbin:/opt/SUNWspro/bin:/usr/X/bin:/usr/ucb:/usr/sbin:/usr/ccs/bin
 + export ANT_HOME=/export/home/hudson/tools/ant/latest
 + ANT_HOME=/export/home/hudson/tools/ant/latest
 + export PATH ANT_HOME
 + cd trunk
 + /export/home/hudson/tools/ant/latest/bin/ant -Dversion=2010-10-19_04-00-41 
 -Dtest.junit.output.format=xml nightly
 /tmp/hudson7277994413075810777.sh: line 7: 
 /export/home/hudson/tools/ant/latest/bin/ant: No such file or directory

Do you know guys why the automated builds are failing? Looks like Ant is
not where the build script expects it to be...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ReviewBoard Instance

2010-10-26 Thread Andrzej Bialecki
On 2010-10-26 15:53, Mattmann, Chris A (388J) wrote:
 Hi Guys,
 
 Gav from infra@ set up a ReviewBoard instance for Apache [1]. I've never
 used it before but I thought I'd request an account on it for Nutch [2]
 regardless, so if folks want to use it, they can.

Hmm, I may be missing something... but what's the point of using the
tool in our JIRA-based workflow? It looks to me like it duplicates at
least part of JIRA's functionality, and the remaining part is what we do
also in JIRA by convention...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Java.io.IOException with multiple copyField/ directives

2010-12-03 Thread Andrzej Bialecki
On 2010-12-03 09:52, Peter Litsegård wrote:
 Hi!
 
 I've run into a strange behaviour while using Nutch (solrindexer) together 
 with Solr 1.4.1. I'd like to copy the 'title' and 'content' field to another 
 field, say, 'foo'. In my first attempt I added the copyField/ directives in 
 schema.xml and got the java exception so I removed them from schema.xml. In 
 my second attempt I added the copyField/ directives to the 
 'solrindex-mapping.xml' file and ran into the same exception again! Is this a 
 known issue or have I stumbled into unknown territory?
 
 Any workarounds?

I suspect that the field type declared in your schema.xml is not
multiValued. What was the exception?


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Andrzej Bialecki

(switching to devs)

On 12/17/10 10:18 AM, Alexis wrote:

Hi,

I've spent some time working on this as well. I've just put together a
blog entry addressing the issues I ran into. See
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

In a nutchsell, I changed three pieces in Gora and Nutch code:
- flush the datastore regularly in the Hadoop RecordWriter (in GoraOutputFormat)


Careful here. DataStore flush may be very expensive, so it should be 
done only when we are finished with the output. If you see that data is 
lost without this flush then this should be reported as a Gora bug.



- wait for Hadoop job completion in the Fetcher job


I missed your previous email... I'll fix this shortly - thanks for 
spotting it.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Gora/HBase dependencies and deploy artifacts

2011-02-24 Thread Andrzej Bialecki

Hi all,

Recently I've been deploying Nutch trunk to an already existing Hadoop 
cluster. And immediately I hit a snag.


Nutch was configured to use gora-hbase. The nutch.job jar doesn't 
include gora-hbase even if it was configured in nutch-site.xml. 
Furthermore, gora-hbase depends on HBase and its dependencies, which 
need to be found on classpath.


Typically for development and testing I solved this issue by deploying 
gora-core and gora-hbase + all hbase libs to hadoop/lib across the 
cluster. This is a bit dirty - Hadoop clusters should be seen as a 
generic computing fabric, so they should be application-agnostic, 
besides this creates maintenance  ops issues.


We could put all these libs in lib/ inside nutch.job, so that they are 
unpacked and put on classpath during task setup. This would work fine 
for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that 
InputFormat / OutputFormat classes were initialized prior to this 
unpacking - and in our case these depend on the libs in as-yet-unpacked 
job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in 
Hadoop 0.20.2, so his is something that needs to be tested).


Furthermore, even if we packed the jars in lib/ inside nutch.job, still 
many tools wouldn't work, because they depend on classes from those libs 
during the local execution (before the job is sent to task trackers), 
and the URLClassLoader can't load classes from jars within jars... A 
workaround for this would be to take all those jars and re-pack them 
together under / directory in nutch.job. This would satisfy the 
dependencies for local execution, and for Mapper/Reducer execution but 
I'm not sure if it solves the problem of Input/OutputFormat-s that I 
mentioned above.


In short, we need a clear working procedure how to deploy Gora backend 
implementations so that they work with Nutch and with a generic 
unmodified Hadoop cluster.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-10 Thread Andrzej Bialecki

On 3/10/11 10:57 PM, Julien Nioche (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-951.
---


NUTCH-825 committed in revision 1080368
All the known improvements from 2.0 have been backported into 1.3 now



The only remaining issue to address before rolling out a 1.3 release is 
NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki

On 3/18/11 4:31 PM, Markus Jelsma wrote:

Hi all,

I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963
to trunk after committing to 1.3. There are of course a lot of differences so
i need a little advice on how to procede:

- instead of using CrawlDB and CrawlDatum we now need WebTableReader?


Actually you need to use StorageUtils to set up Mapper or Reducer 
contexts. See other tools, e.g. Fetcher or Generator.



- trunk uses slf instead of commons logging now?


Yes.


- a page is now represented by storage.WebPage?


Yes. When you prepare a Job you also need to specify what fields from 
WebPage you are interested in (and only these fields will be pulled in 
from the storage). This is all handled by StorageUtils methods.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Move 2.0 out of trunk

2011-09-20 Thread Andrzej Bialecki

On 18/09/2011 02:21, Julien Nioche wrote:

Hi,

Following the discussions [1] on the dev-list about the future of Nutch
2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk
to a separate branch, promote 1.4 to trunk and consider 2.0 as
unmaintained. The arguments for / against can be found in the thread I
mentioned.

The vote is open for the next 72 hours.

[ ] +1 : Shelve 2.0 and move 1.4 to trunk
[] 0 : No opinion
[] -1 : Bad idea.  Please give justification.


+1 - at this time it's clear that 2.0 didn't pan out as we expected, and 
we should restart from the 1.x for a usable platform, and continue 
redesign from that codebase.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki

On 12/10/2011 13:17, Markus Jelsma (Commented) (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125717#comment-13125717
 ]

Markus Jelsma commented on NUTCH-797:
-

This test was on a local instance. I tried both values for 
parser.fix.embeddedparams with:
$ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek


Is this how it should be implemented? I'm not sure. Embedded params are a bit 
puzzling :)


Hmm ... if that's the exact command-line expression that you entered 
then if you are using a *nix shell the semicolon would mean the end of 
command, so in fact what was executed would be:


$ bin/nutch parsechecker http://www.funkybabes.nl/
...lots of output ...
bash: ROOOWAN/fotoboek: command not found


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS

2011-11-05 Thread Andrzej Bialecki

On 05/11/2011 06:44, Mattmann, Chris A (388J) wrote:

Hey Guys,

I modified the Jenkins jobs that Lewis set up to now:

* poll SCM hourly for changes to Nutch
* publish Maven snapshots (1.5-SNAPSHOT) and above of Nutch
to repository.apache.org


Very useful - thanks a lot!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Persistent problems with Ivy dependencies in Eclipse

2011-11-10 Thread Andrzej Bialecki

On 10/11/2011 04:39, Lewis John Mcgibbney wrote:

Gets even more strange, both SWFParser and AutomationURLFilter import
additonal depenedencies, however they are not included within thier
plugin/ivy/ivy.xml files!

Am I missing something here?


Most likely these problems come from the initial porting of a pure ant 
build to an ant+ivy build. We should determine what deps are really 
needed by these plugins, and sanitize the ivy.xml files so that they 
make sense - if the existing files can't be untangled we can ditch them 
and come up with new, clean ones.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Signature == null ?

2011-11-15 Thread Andrzej Bialecki

On 15/11/2011 20:33, Markus Jelsma wrote:

It's back again! Last try if someone has a pointer for this.
Cheers


After some DB updates, they're gone! Anyone recognizes this phenomenon?

On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:

On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:

Hi guys,

I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records and
their signatures. I had to add a sanity check on signature to avoid a
NPE. I had the assumption any record with such DB_ status has to have a
signature, right?

Why does roughly 0.0001625% of my records exit without a signature?


Now with correct metrics:
Why does roughly 0.84% of my records exist without a signature?


This could be somehow related to pages that come from redirects so that 
when they are fetched they are accounted for under different urls, which 
in turn may confuse the update code in CrawlDbReducer... Do you notice 
any pattern to these pages? What's their origin?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dependency Injection

2011-11-22 Thread Andrzej Bialecki

On 22/11/2011 19:47, PJ Herring wrote:

Hey Chris,

Thanks for the response. I looked at the documents you sent me, and I
really do think incorporating some kind of DI Framework could be a great
addition to Nutch.

I have a general plan of attack, but I'll try to write that up more
formally and send it out to get some kind of feedback.


This sounds interesting. As Chris mentioned, the current plugin system 
is far from ideal, but so far it worked reasonably well. The key 
functionality that it implements is:


* self-discovery of services provided by each plugin,
* easy pluggability, by the virtue of dropping super-jars (jars with 
impl. classes and nested library jars) to a predefined location,
* controlled classloader isolation between plugins so that incompatible 
versions of libraries can be used
* but also ability to export specified classes and libraries so that one 
plugin can use other plugin's exported resources on its classpath.

* optional auto-loading of dependent plugins

In the past one contributor made a bold attempt to port Nutch to OSGI, 
and it turned out to be much more complicated than we expected, and with 
a bigger impact on the way Nutch applications were supposed to run ... 
so at that time we didn't think this complication was justified.


If we can figure out something between full-blown OSGI and the current 
system then that would be great.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dependency Injection

2011-11-23 Thread Andrzej Bialecki

On 23/11/2011 01:02, Andrzej Bialecki wrote:

On 22/11/2011 19:47, PJ Herring wrote:

Hey Chris,

Thanks for the response. I looked at the documents you sent me, and I
really do think incorporating some kind of DI Framework could be a great
addition to Nutch.

I have a general plan of attack, but I'll try to write that up more
formally and send it out to get some kind of feedback.


This sounds interesting. As Chris mentioned, the current plugin system
is far from ideal, but so far it worked reasonably well. The key
functionality that it implements is:

* self-discovery of services provided by each plugin,
* easy pluggability, by the virtue of dropping super-jars (jars with
impl. classes and nested library jars) to a predefined location,
* controlled classloader isolation between plugins so that incompatible
versions of libraries can be used
* but also ability to export specified classes and libraries so that one
plugin can use other plugin's exported resources on its classpath.
* optional auto-loading of dependent plugins

In the past one contributor made a bold attempt to port Nutch to OSGI,
and it turned out to be much more complicated than we expected, and with
a bigger impact on the way Nutch applications were supposed to run ...
so at that time we didn't think this complication was justified.

If we can figure out something between full-blown OSGI and the current
system then that would be great.



You may also want to take a look at JSPF (http://code.google.com/p/jspf) 
which perhaps could be made to satisfy the above requirements without 
too much refactoring.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 17:42, Lewis John Mcgibbney wrote:

Hi Markus,

I'm certainly in agreement here. If you like to open a Jira, we can
begin the build up a picture of what is required.

Lewis

On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
markus.jel...@openindex.io  wrote:

Hi,

To keep up with the rest of the world i believe we should move from the old
Hadoop mapred API to the new MapReduce API, which has already been done for
the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
done in Ivy but all jobs must be tackled and we have many jobs!

Anyone to give pointers and helping hand in this large task?


I guess the question is also whether the 0.22 is compatible enough to 
compile more or less with the existing code that uses the old api. If it 
does, then we can do the transition gradually, if it doesn't then it's a 
bigger issue.


This is easy to verify - just drop in the 0.22 jars and see if it 
compiles / tests are passing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 18:04, Markus Jelsma wrote:

Hi

I did a quick test to see what happens and it won't compile. It cannot find
our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but
won't run and many tests fail with stuff like.

Exception in thread main java.lang.NoClassDefFoundError:
org/codehaus/jackson/map/JsonMappingException
 at
org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421)


Hmm... what's that? I don't see this class (or this package) in the 
Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.



 at
org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431)
Caused by: java.lang.ClassNotFoundException:
org.codehaus.jackson.map.JsonMappingException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 ... 4 more

I think this can be overcome but we cannot hide from the fact that all jobs
must be ported to the new API at some point.

You did some work on the new API's, did you come across any cumbersome issues
when working on it?


It was quite some time ago .. but I don't remember anything being really 
complicated, it was just tedious - and once you've done one class the 
other classes follow roughly the same pattern.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki

On 14/12/2011 16:01, Markus Jelsma wrote:

This is highly annoying, MapFileOutputFormat is not present in the MapReduce
API until 0.21!


AFAIK that's not the case ... there is both an old api and a new api 
implementation (the old one is deprecated). The new api is in 
org.apache.hadoop.mapreduce.lib.output .


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki

On 14/12/2011 18:30, Markus Jelsma wrote:

proper link:

http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapreduce/lib/output/package-summary.html


I thought the goal was to upgrade to 0.22, where this class is present. 
In 0.20.205 org.apache.hadoop.mapred.MapFileOutputFormat still uses the 
old api, and it's not deprecated yet.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Andrzej Bialecki

On 15/12/2011 13:13, Markus Jelsma wrote:

hmm, i don't see how i can use the old mapred MapOutputFormat API with the new
Job API. job.setOutputFormatClass(MapFileOutputFormat.class) expects an the
mapreduce.lib.output.MapFileOutputFormat class and won't accept the old API.

setOutputFormatClass(java.lang.Class? extends
org.apache.hadoop.mapreduce.OutputFormat) in org.apache.hadoop.mapreduce.Job
cannot be applied to
(java.lang.Classorg.apache.hadoop.mapred.MapFileOutputFormat)

In short, i don't know how i can migrate jobs to the new API on 0.20.x without
having MapFileOutputFormat present in the new API. Trying to set to old
mapoutputformat


Ah, no, that's now what I meant ... of course you need to change the 
code to use the new api, and the new code will look quite different :) 
my point was only that it is different in a consistent way, so after 
you've ported one or two classes the other ones are easy to convert, too...


I'm bogged with other work now, but I'll see if I can prepare an example 
later today...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki

On 28/12/2011 12:00, Lewis John Mcgibbney wrote:

Hi Guys,

Pretty strange compilation failure, this test class hasn't been hacked
in months, and from the surface, having looked at the test case there
appears to be no obvious reasons for it failing to compile. I've kick
started another build on Jenkins to see if it will resolve itself.


I don't think it will - I can reproduce this failure locally. Here's 
what fixed the failure for me (I'm pretty ignorant about ivy/maven so 
there's likely a more correct fix for this):


Index: ivy/ivy.xml
===
--- ivy/ivy.xml (revision 1225046)
+++ ivy/ivy.xml (working copy)
@@ -69,7 +69,7 @@
!--Configuration: test --

!--artifacts needed for testing --
-   dependency org=junit name=junit rev=3.8.1 
conf=test-default /
+   dependency org=junit name=junit rev=3.8.1 conf=*-default 
/
dependency org=org.apache.hadoop name=hadoop-test 
rev=0.20.205.0
conf=test-default /


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki

On 28/12/2011 14:15, Lewis John Mcgibbney wrote:

Hi Andrzej,

Can anyone confirm? I've tried this patch locally and although I
couldn't reproduce the original issue, it seems to be working fine for
me as well.


Check your lib/ dir, maybe you have a local copy of junit jar that gets 
pulled on the classpath and masks the issue? this happened to me once or 
twice...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Andrzej Bialecki

On 02/03/2012 12:45, Lewis John Mcgibbney wrote:

Hi Guys,

As there were some comments on the user list, I recently got digging
with http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually'
don't consider them to be interlinked as such and therefore struggle to
debug how and why either the redirect or the crawl delay pages are not
being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start
thinking about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
users.


Yes.


- Can someone shine some light on what happened to Fetcher2.java that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)


Fetcher2 is the current Fetcher. The original Fetcher was temporarily 
renamed OldFetcher and then removed.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: question about ObjectCache

2012-04-10 Thread Andrzej Bialecki

On 10/04/2012 05:00, Xiaolong Yang wrote:

Hi,all

I'm reading source code of nutch and I have some puzzled about the
ObjectCache.java in package org.apache.nutch.util.I just find it may be
little benefit to use it in urlnormalizers and urlfiters.I also have
read some discuss about cache in Nutch-169 and Nutch-501.But I can't
understand it.

Can anyone tell me where ObjectCache be used and get a good benefit in
nutch ?


ObjectCache is designed to cache ready-to-use instances of Nutch 
plugins. The process of finding, instantiating and initializing plugins 
is inefficient, because it involves parsing plugin descriptors, 
initializing plugins, collecting the ones that implement correct 
extension points, etc.


It would kill performance if this process were invoked each time you 
want to run all plugins of a given type (e.g. URLNormalizer-s). The 
facade URLNormalizers/URLFilters and others make sure that plugin 
instances of a given type are initialized once per lifetime of a JVM, 
and then they are cached in ObjectCache, so that next time you want to 
use them they can be retrieved from a cache, instead of going again 
through the process of parsing/instantiating/initializing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-650) Hbase Integration

2010-06-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883559#action_12883559
 ] 

Andrzej Bialecki  commented on NUTCH-650:
-

So far as one can digest such a giant patch ;) I think this is ok, at least 
from the legal POV it clarifies the situation and it doesn't bring any 
dependencies with incompatible licenses. As for the content itself, we'll need 
to resolve this incrementally, as discussed on the list.

So, a cautious +1 from me to apply this on branches/nutchbase.

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 latest-nutchbase-vs-original-branch-point.patch, 
 latest-nutchbase-vs-svn-nutchbase.patch, malformedurl.patch, meta.patch, 
 meta2.patch, nofollow-hbase.patch, NUTCH-650.patch, nutch-habase.patch, 
 searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-01 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-837:
---

Assignee: Andrzej Bialecki 

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: NUTCH-837.patch

Updated patch against r959954 (after NUTCH-836).

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch, NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-837:


Attachment: (was: NUTCH-837.patch)

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729
 ] 

Andrzej Bialecki  commented on NUTCH-837:
-

bq. So, I think we should still have a Nutch webapp and in my mind it's a 
must-have for a 2.0 release...

I agree. But for the moment it's better to delete the old webapp stuff that we 
know for sure doesn't work with the current Nutch, and it will be completely 
reimplemented anyway.

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-837.
-

Resolution: Fixed

Committed in r960064. Thanks for review!

 Remove search servers and Lucene dependencies 
 --

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-837.patch


 One of the main aspects of 2.0 is the delegation of the indexing and search 
 to external resources like SOLR. We can simplify the code a lot by getting 
 rid of the : 
 * search servers
 * indexing and analysis with Lucene
 * search side functionalities : ontologies / clustering etc...
 In the short term only SOLR / SOLRCloud will be supported but the plan would 
 be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885188#action_12885188
 ] 

Andrzej Bialecki  commented on NUTCH-821:
-

I think this patch refers to some parts that were already removed in NUTCH-837 
...

Also, it would be nice to have a target that sets up an Eclipse project - after 
this patch is applied the lib/ is nearly empty and you need to run build at 
least once to bring dependencies - this may be confusing.

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-696:


Attachment: timeout.patch

A simple patch that implements the strategy outlined here http://bit.ly/bdTYrS 
- I've been recently suffering from this issue, so this is better than nothing. 
Julien's strategy would work, too, but then the job takes much longer to 
execute.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885257#action_12885257
 ] 

Andrzej Bialecki  commented on NUTCH-696:
-

Yes - this patch is a quick solution that allowed me to complete a crawl. If 
people feel this is useful, let's polish it.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reopened NUTCH-696:
-


This may be useful after all - let's gather more comments.

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885295#action_12885295
 ] 

Andrzej Bialecki  commented on NUTCH-696:
-

I agree, ultimately that's the way to go. However, I needed something _now_, 
and the patch helps to solve the problem that I have now - and until this 
problem is solved in Tika this patch provides some kind of band-aid for us poor 
Nutch-ers...

 Timeout for Parser
 --

 Key: NUTCH-696
 URL: https://issues.apache.org/jira/browse/NUTCH-696
 Project: Nutch
  Issue Type: Wish
  Components: fetcher
Reporter: Julien Nioche
Priority: Minor
 Attachments: timeout.patch


 I found that the parsing sometimes crashes due to a problem on a specific 
 document, which is a bit of a shame as this blocks the rest of the segment 
 and Hadoop ends up finding that the node does not respond. I was wondering 
 about whether it would make sense to have a timeout mechanism for the parsing 
 so that if a document is not parsed after a time t, it is simply treated as 
 an exception and we can get on with the rest of the process.
 Does that make sense? Where do you think we should implement that, in 
 ParseUtil?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-06 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885583#action_12885583
 ] 

Andrzej Bialecki  commented on NUTCH-821:
-

+1 for this patch for now - all good comments, there's plenty of improvements 
we can make, so let's line them up as separate issues.

 Use ivy in nutch builds
 ---

 Key: NUTCH-821
 URL: https://issues.apache.org/jira/browse/NUTCH-821
 Project: Nutch
  Issue Type: New Feature
  Components: build
Affects Versions: 2.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0

 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch


 Ivy is the de-facto dependency management tool used in conjunction with Ant. 
 It would be nice if we switch to using Ivy in Nutch builds. 
 Maven is also an alternative, but I think Nutch will benefit more with an 
 Ant+Ivy architecture. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
Separate the build and runtime environments
---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Currently there is no clean separation of source, build and runtime artifacts. 
On one hand, it makes it easier to get started in local mode, but on the other 
hand it makes the distributed (or pseudo-distributed) setup much more 
challenging and tricky. Also, some resources (config files and classes) are 
included several times on the classpath, they are loaded under different 
classloaders, and in the end it's not obvious what copy and why takes 
precedence.

Here's an example of a harmful unintended behavior caused by this mess: Hadoop 
daemons (jobtracker and tasktracker) will get conf/ and build/ on their 
classpath. This means that a task running on this cluster will have two copies 
of resources from these locations - one from the inherited classpath from 
tasktracker, and the other one from the just unpacked nutch.job file. If these 
two versions differ, only the first one will be loaded, which in this case is 
the one taken from the (unpacked) conf/ and build/ - the other one, from within 
the nutch.job file, will be ignored.

It's even worse when you add more nodes to the cluster - the nutch.job will be 
shipped to the new nodes as a part of each task setup, but now the remote 
tasktracker child processes will use resources from nutch.job - so some tasks 
will use different versions of resources than other tasks. This usually leads 
to a host of very difficult to debug issues.

This issue proposes then to separate these environments into the following 
areas:

* source area - i.e. our current sources. Note that bin/ scripts will belong to 
this category too, so there will be no top-level bin/. nutch-default.xml 
belongs to this category too. Other customizable files can be moved to src/conf 
too, or they could stay in top-level conf/ as today, with a README that 
explains that changes made there take effect only after you rebuild the job jar.

* build area - contains build artifacts, among them the nutch.job jar.

* runtime (or deploy) area - this area contains all artifacts needed to run 
Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
(installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
directory, where we put the following:
{code}
bin/nutch
nutch.job
{code}
That's it - nothing else should be needed, because all other resources are 
already included in the job jar. These resources can be copied directly to the 
master Hadoop node.

For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
directory, where we put the following:
{code}
bin/nutch
lib/hadoop-libs
plugins/
nutch.job
{code}
Due to limitations in the PluginClassLoader the local runtime requires that the 
plugins/ directory be unpacked from the job jar. And we need the hadoop libs to 
run in the local mode. We may later on refine this local setup to something 
like this:
{code}
bin/nutch
conf/
lib/hadoop-libs
lib/nutch-libs
plugins/
nutch.jar
{code}
so that it's easier to modify the config without rebuilding the job jar (which 
actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-843:


Attachment: NUTCH-843.patch

This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and 
/runtime/local areas, populated with the right pieces. bin/nutch has been 
modified to work correctly in both cases.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015
 ] 

Andrzej Bialecki  commented on NUTCH-843:
-

We need to create the job file anyway. Actually, the patch I attached does 
something like this for the local setup (lib/ is flattened), but still I would 
argue for setting up two areas, /runtime/deploy and /runtime/local - it's 
painfully obvious then what parts you need to deploy to a Hadoop cluster.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-843:


Attachment: NUTCH-843.patch

Updated patch that moves nutch.jar to lib/ for the local runtime.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-843.patch, NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-08 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-844:


Attachment: conf.patch

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886318#action_12886318
 ] 

Andrzej Bialecki  commented on NUTCH-843:
-

runtime/local doesn't need Hadoop scripts, by definition it uses local FS and 
local job tracker, so Hadoop scripts are of no use. Native libs .. see 
NUTCH-845.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-843.patch, NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886330#action_12886330
 ] 

Andrzej Bialecki  commented on NUTCH-843:
-

Pseudo-distributed (i.e. a real JobTracker with a single TaskTracker) suffers 
from the same classpath issues that I described above, so even in such case 
it's best to run jobs in a separate environment, using /runtime/deploy 
artifacts.

 Separate the build and runtime environments
 ---

 Key: NUTCH-843
 URL: https://issues.apache.org/jira/browse/NUTCH-843
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-843.patch, NUTCH-843.patch


 Currently there is no clean separation of source, build and runtime 
 artifacts. On one hand, it makes it easier to get started in local mode, but 
 on the other hand it makes the distributed (or pseudo-distributed) setup much 
 more challenging and tricky. Also, some resources (config files and classes) 
 are included several times on the classpath, they are loaded under different 
 classloaders, and in the end it's not obvious what copy and why takes 
 precedence.
 Here's an example of a harmful unintended behavior caused by this mess: 
 Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on 
 their classpath. This means that a task running on this cluster will have two 
 copies of resources from these locations - one from the inherited classpath 
 from tasktracker, and the other one from the just unpacked nutch.job file. If 
 these two versions differ, only the first one will be loaded, which in this 
 case is the one taken from the (unpacked) conf/ and build/ - the other one, 
 from within the nutch.job file, will be ignored.
 It's even worse when you add more nodes to the cluster - the nutch.job will 
 be shipped to the new nodes as a part of each task setup, but now the remote 
 tasktracker child processes will use resources from nutch.job - so some tasks 
 will use different versions of resources than other tasks. This usually leads 
 to a host of very difficult to debug issues.
 This issue proposes then to separate these environments into the following 
 areas:
 * source area - i.e. our current sources. Note that bin/ scripts will belong 
 to this category too, so there will be no top-level bin/. nutch-default.xml 
 belongs to this category too. Other customizable files can be moved to 
 src/conf too, or they could stay in top-level conf/ as today, with a README 
 that explains that changes made there take effect only after you rebuild the 
 job jar.
 * build area - contains build artifacts, among them the nutch.job jar.
 * runtime (or deploy) area - this area contains all artifacts needed to run 
 Nutch jobs. For a distributed setup that uses an existing Hadoop cluster 
 (installed from plain vanilla Hadoop release) this will be a {{/deploy}} 
 directory, where we put the following:
 {code}
 bin/nutch
 nutch.job
 {code}
 That's it - nothing else should be needed, because all other resources are 
 already included in the job jar. These resources can be copied directly to 
 the master Hadoop node.
 For a local setup (using LocalJobTracker) this will be a {{/runtime}} 
 directory, where we put the following:
 {code}
 bin/nutch
 lib/hadoop-libs
 plugins/
 nutch.job
 {code}
 Due to limitations in the PluginClassLoader the local runtime requires that 
 the plugins/ directory be unpacked from the job jar. And we need the hadoop 
 libs to run in the local mode. We may later on refine this local setup to 
 something like this:
 {code}
 bin/nutch
 conf/
 lib/hadoop-libs
 lib/nutch-libs
 plugins/
 nutch.jar
 {code}
 so that it's easier to modify the config without rebuilding the job jar 
 (which actually would not be used in this case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-845) Native hadoop libs not available through maven

2010-07-08 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-845.
-

Fix Version/s: 2.0
   Resolution: Fixed

Committed in rev. 961778. Thanks for review!

 Native hadoop libs not available through maven
 --

 Key: NUTCH-845
 URL: https://issues.apache.org/jira/browse/NUTCH-845
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 There are no maven artifacts for the native libs (I verified this on Hadoop 
 ML). I think it's better to delete the libs, after all we don't want to keep 
 bits and pieces of dependencies in our svn, but let's leave a placeholder and 
 a README that explains how to get them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-844:


Attachment: NUTCH-844.patch

Updated patch. This also addresses an issue in PluginRepository that uses 
Configuration as a key in its internal cache - the problem though is that 
Configuration doesn't implement hashCode, so the cache would have been 
ineffective in situations like this:
{code}
Configuration conf = NutchConfiguration.create();
PluginRepository repo1 = PluginRepository.get(conf);
JobConf job = new NutchJob(conf);
PluginRepository repo2 = PluginRepository.get(job);
// repo2 is a new instance, but should be the same instance!
{code}

The new code sets a UUID property, so the cache knows it's still the same 
instance. There's a new unit test to ensure this works properly when using 
NutchConfiguration.create(), and illustrates that it fails without the uuid.

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch, NUTCH-844.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-844.
-

Resolution: Fixed

Committed in r964063. Thanks for review!

 Improve NutchConfiguration
 --

 Key: NUTCH-844
 URL: https://issues.apache.org/jira/browse/NUTCH-844
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: conf.patch, NUTCH-844.patch


 This patch cleans up NutchConfiguration from servlet dependency, and modifies 
 the API to allow bootstrapping via API from Properties. This is important for 
 use cases where Nutch is embedded in a larger application.
 Also, while I'm at it, remove the support for alternative crawl 
 configuration when running Crawl tool, which has always been a source of 
 confusion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-858:


 Assignee: Andrzej Bialecki 
Fix Version/s: 1.2

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890873#action_12890873
 ] 

Andrzej Bialecki  commented on NUTCH-858:
-

Unfortunately no. The patch was included in a fix to NUTCH-837, which is 
relative to trunk, and it's not directly applicable to 1.x, needs to be ported.

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-863) Benchmark and a testbed proxy server

2010-07-30 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-863.
-

Fix Version/s: 2.0
   Resolution: Fixed

Committed in rev. 980932.

 Benchmark and a testbed proxy server
 

 Key: NUTCH-863
 URL: https://issues.apache.org/jira/browse/NUTCH-863
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: proxy.patch


 This issue adds two components:
 * a testbed proxy server that can serve various content: pre-fetched Nutch 
 segments, forward requests to original URLs, or create a lot of unique but 
 predictable fake content (with outlinks) on the fly.
 * a simple Benchmark class to measure the time taken to complete several 
 crawl cycles using fake content.
 * 'ant proxy' and 'ant benchmark' targets to execute a benchmark run.
 Together these tools should provide a more or less objective method to 
 measure the end-to-end crawl performance. This initial version can be further 
 instrumented to collect statistics about various stages of data processing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-07-31 Thread Andrzej Bialecki (JIRA)
Port Nutch benchmark to Nutchbase
-

 Key: NUTCH-867
 URL: https://issues.apache.org/jira/browse/NUTCH-867
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchbase
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: nutchbase


Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the 
Nutchbase branch vs. trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-08-04 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895377#action_12895377
 ] 

Andrzej Bialecki  commented on NUTCH-858:
-

It was r960064, but I have to admit I sneaked in this improvement as a part of 
NUTCH-837, which contained a lot of other stuff...

 No longer able to set per-field boosts on lucene documents
 --

 Key: NUTCH-858
 URL: https://issues.apache.org/jira/browse/NUTCH-858
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.1
 Environment: n/a
Reporter: Edward Drapkin
Assignee: Andrzej Bialecki 
 Fix For: 1.2


 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
 no longer seems possible to set boosts on specific fields in lucene 
 documents.  This is, in my opinion, a major feature regression and removes a 
 huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-08-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-867:


Attachment: benchmark.patch

Ported benchmark that uses HSQLDB as the store impl. If there are no objections 
I'll commit it shortly.

 Port Nutch benchmark to Nutchbase
 -

 Key: NUTCH-867
 URL: https://issues.apache.org/jira/browse/NUTCH-867
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchbase
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: nutchbase

 Attachments: benchmark.patch


 Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the 
 Nutchbase branch vs. trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-876:


Attachment: NUTCH-876.patch

Patch to fix the issue. If there are no objections I'll commit this shortly.

 Remove remaining robots/IP blocking code in lib-http
 

 Key: NUTCH-876
 URL: https://issues.apache.org/jira/browse/NUTCH-876
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-876.patch


 There are remains of the (very old) blocking code in 
 lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage 
 politeness limits. New trunk doesn't have OldFetcher anymore, so this code is 
 useless. Furthermore, there is an actual bug here - FetcherJob forgets to set 
 Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults 
 in lib-http are set to true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-879) URL-s getting lost

2010-08-10 Thread Andrzej Bialecki (JIRA)
URL-s getting lost
--

 Key: NUTCH-879
 URL: https://issues.apache.org/jira/browse/NUTCH-879
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
* using 1-node Hadoop + HDFS
* trunk r983472, using MySQL store
* branch-1.3
Reporter: Andrzej Bialecki 


I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
urls, while trunk collects ~20,000 urls. Clearly something is wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Description: 
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling requests and returning 
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create  manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.

  was:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create  manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.


 REST API (and webapp) for Nutch
 ---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 

 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)
FetcherJob should run more reduce tasks than default


 Key: NUTCH-884
 URL: https://issues.apache.org/jira/browse/NUTCH-884
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


FetcherJob now performs fetching in the reduce phase. This means that in a 
typical Hadoop setup there will be many fewer reduce tasks than map tasks, and 
consequently the max. total throughput of Fetcher will be proportionally 
reduced. I propose that FetcherJob should set the number of reduce tasks to the 
number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-872.
-

Fix Version/s: 2.0
   Resolution: Fixed

I changed the name of the option to -parse to be consistent with the 
nutch-default.xml naming. I also updated the API to use this name, it's less 
confusing this way.

Committed in rev. 984401. Thanks for the feedback.

 Change the default fetcher.parse to FALSE
 -

 Key: NUTCH-872
 URL: https://issues.apache.org/jira/browse/NUTCH-872
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.2, 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


 I propose to change this property to false. The reason is that it's a safer 
 default - parsing issues don't lead to a loss of the downloaded content. For 
 larger crawls this is the recommended way to run Fetcher. Users that run 
 smaller crawls can still override it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-884:


Attachment: NUTCH-884.patch

Patch with the change. I also rearranged the arguments to FetcherJob.fetch(..) 
to make more sense (IMHO).

 FetcherJob should run more reduce tasks than default
 

 Key: NUTCH-884
 URL: https://issues.apache.org/jira/browse/NUTCH-884
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-884.patch


 FetcherJob now performs fetching in the reduce phase. This means that in a 
 typical Hadoop setup there will be many fewer reduce tasks than map tasks, 
 and consequently the max. total throughput of Fetcher will be proportionally 
 reduced. I propose that FetcherJob should set the number of reduce tasks to 
 the number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-08-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899810#action_12899810
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

This functionality is very useful for larger crawls. Some comments about the 
design:

* the table can be populated by injection, as in the patch, or from webtable. 
Since keys are from different spaces (url-s vs. hosts) I think it would be very 
tricky to try to do this on the fly in one of the existing jobs... so this 
means an additional step in the workflow.

* I'm worried about the scalability of the approach taken by HostMDApplierJob - 
per-host data will be multiplied by the number of urls from a host and put into 
webtable, which will in turn balloon the size of webtable...

A little background: what we see here is a design issue typical for mapreduce, 
where you have to merge data keyed by keys from different spaces (with 
different granularity). Possible solutions involve:
* first converting the data to a common key space and then submit both data as 
mapreduce inputs, or
* submitting only the finer-grained input to mapreduce and dynamically 
converting the keys on the fly (and reading data directly from the 
coarser-grained source, accessing it randomly).

A similar situation is described in HADOOP-3063 together with a solution, 
namely, to use random access and use Bloom filters to quickly discover missing 
keys.

So I propose that instead of statically merging the data (HostMDApplierJob) we 
could merge it dynamically on the fly, by implementing a high-performance 
reader of host table, and then use this reader directly in the context of 
map()/reduce() tasks as needed. This reader should use a Bloom filter to 
quickly determine nonexistent keys, and it may use a limited amount of 
in-memory cache for existing records. The bloom filter data should be 
re-computed on updates and stored/retrieved, to avoid lengthy initialization.

The cost of using this approach is IMHO much smaller than the cost of 
statically joining this data. The static join costs both space and time to 
execute an additional jon. Let's consider the dynamic join cost, e.g. in 
Fetcher - HostDBReader would be used only when initializing host queues, so the 
number of IO-s would be at most the number of unique hosts on the fetchlist (at 
most, because some of host data may be missing - here's Bloom filter to the 
rescue to quickly discover this without doing any IO). During updatedb we would 
likely want to access this data in DbUpdateReducer. Keys are URLs here, and 
they are ordered in ascending order - but they are in host-reversed format, 
which means that URLs from similar hosts and domains are close together. This 
is beneficial, because when we read data from HostDBReader we will read records 
that are close together, thus avoiding seeks. We can also cache the retrieved 
per-host data in DbUpdateReducer.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)
Nutch build should not depend on unversioned local deps
---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 


The fix in NUTCH-873 introduces an unknown variable to the build process. Since 
local ivy artifacts are unversioned, different people that install Gora jars at 
different points in time will use the same artifact id but in fact the 
artifacts (jars) will differ because they will come from different revisions of 
Gora sources. Therefore Nutch builds based on the same svn rev. won't be 
repeatable across different environments.

As much as it pains the ivy purists ;) until Gora publishes versioned artifacts 
I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a 
known external rev. We can add a README that contains commit id from Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900455#action_12900455
 ] 

Andrzej Bialecki  commented on NUTCH-891:
-

Yes, this would help.

 Nutch build should not depend on unversioned local deps
 ---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 

 The fix in NUTCH-873 introduces an unknown variable to the build process. 
 Since local ivy artifacts are unversioned, different people that install Gora 
 jars at different points in time will use the same artifact id but in fact 
 the artifacts (jars) will differ because they will come from different 
 revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
 won't be repeatable across different environments.
 As much as it pains the ivy purists ;) until Gora publishes versioned 
 artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
 built from a known external rev. We can add a README that contains commit id 
 from Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)
DataStore.put() silently loses records when executed from multiple processes


 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
1.6
Reporter: Andrzej Bialecki 


In order to debug the issue described in NUTCH-879 I created a test to simulate 
multiple clients appending to webtable (please see the patch), which is the 
situation that we have in distributed map-reduce jobs.

There are two tests there: one that uses multiple threads within the same JVM, 
and another that uses single thread in multiple JVMs. Each test first clears 
webtable (be careful!), and then puts a bunch of pages, and finally counts that 
all are present and their values correspond to keys. To make things more 
interesting each execution context (thread or process) closes and reopens its 
instance of DataStore a few times.

The multithreaded test passes just fine. However, the multi-process test fails 
with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-893:


Attachment: NUTCH-893.patch

Unit test to illustrate the issue.

 DataStore.put() silently loses records when executed from multiple processes
 

 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
 1.6
Reporter: Andrzej Bialecki 
 Attachments: NUTCH-893.patch


 In order to debug the issue described in NUTCH-879 I created a test to 
 simulate multiple clients appending to webtable (please see the patch), which 
 is the situation that we have in distributed map-reduce jobs.
 There are two tests there: one that uses multiple threads within the same 
 JVM, and another that uses single thread in multiple JVMs. Each test first 
 clears webtable (be careful!), and then puts a bunch of pages, and finally 
 counts that all are present and their values correspond to keys. To make 
 things more interesting each execution context (thread or process) closes and 
 reopens its instance of DataStore a few times.
 The multithreaded test passes just fine. However, the multi-process test 
 fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904226#action_12904226
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

Dogacan, flush() doesn't help - there are still missing keys. What's 
interesting is that the missing keys form sequential ranges. Could this be 
perhaps an issue with connection management, or some synchronization issue?

 DataStore.put() silently loses records when executed from multiple processes
 

 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
 1.6
Reporter: Andrzej Bialecki 
Priority: Blocker
 Fix For: 2.0

 Attachments: NUTCH-893.patch


 In order to debug the issue described in NUTCH-879 I created a test to 
 simulate multiple clients appending to webtable (please see the patch), which 
 is the situation that we have in distributed map-reduce jobs.
 There are two tests there: one that uses multiple threads within the same 
 JVM, and another that uses single thread in multiple JVMs. Each test first 
 clears webtable (be careful!), and then puts a bunch of pages, and finally 
 counts that all are present and their values correspond to keys. To make 
 things more interesting each execution context (thread or process) closes and 
 reopens its instance of DataStore a few times.
 The multithreaded test passes just fine. However, the multi-process test 
 fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

Very good catch - yes, the test now passes for me too. This is actually good 
news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate 
if you have any ideas how to solve that. I suspect we may be losing keys in 
Generator or Fetcher, due to partitioning collisions but this hypothesis needs 
to be tested.

 DataStore.put() silently loses records when executed from multiple processes
 

 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
 1.6
Reporter: Andrzej Bialecki 
Priority: Blocker
 Fix For: 2.0

 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch


 In order to debug the issue described in NUTCH-879 I created a test to 
 simulate multiple clients appending to webtable (please see the patch), which 
 is the situation that we have in distributed map-reduce jobs.
 There are two tests there: one that uses multiple threads within the same 
 JVM, and another that uses single thread in multiple JVMs. Each test first 
 clears webtable (be careful!), and then puts a bunch of pages, and finally 
 counts that all are present and their values correspond to keys. To make 
 things more interesting each execution context (thread or process) closes and 
 reopens its instance of DataStore a few times.
 The multithreaded test passes just fine. However, the multi-process test 
 fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908791#action_12908791
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

+1 and +1.

 DataStore.put() silently loses records when executed from multiple processes
 

 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
 1.6
Reporter: Andrzej Bialecki 
Priority: Blocker
 Fix For: 2.0

 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch


 In order to debug the issue described in NUTCH-879 I created a test to 
 simulate multiple clients appending to webtable (please see the patch), which 
 is the situation that we have in distributed map-reduce jobs.
 There are two tests there: one that uses multiple threads within the same 
 JVM, and another that uses single thread in multiple JVMs. Each test first 
 clears webtable (be careful!), and then puts a bunch of pages, and finally 
 counts that all are present and their values correspond to keys. To make 
 things more interesting each execution context (thread or process) closes and 
 reopens its instance of DataStore a few times.
 The multithreaded test passes just fine. However, the multi-process test 
 fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-15 Thread Andrzej Bialecki (JIRA)
DataStore API doesn't support multiple storage areas for multiple disjoint 
crawls
-

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0


In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
page data, linkdb, etc) by specifying a path where the data was stored. This 
enabled users to run several disjoint crawls with different configs, but still 
using the same storage medium, just under different paths.

This is not possible now because there is a 1:1 mapping between a specific 
DataStore instance and a set of crawl data.

In order to support this functionality the Gora API should be extended so that 
it can create stores (and data tables in the underlying storage) that use 
arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API 
should be extended to allow passing this crawlId value to select one of 
possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-09-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

+1 to NutchContext. See also NUTCH-907 because the changes required in Gora API 
will likely make this task easier (once implemented ;) ).

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

That's very good news - in that case I'm fine with the Gora API as it is now, 
we should change Nutch to make use of this functionality.

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Attachment: API.patch

Initial patch for discussion. This is a work in progress, so only some 
functionality is implemented, and even less than that is actually working ;)

I would appreciate a review and comments.

 REST API (and webapp) for Nutch
 ---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-862:
---

Assignee: Andrzej Bialecki 

 HttpClient null pointer exception
 -

 Key: NUTCH-862
 URL: https://issues.apache.org/jira/browse/NUTCH-862
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
 Environment: linux, java 6
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: NUTCH-862.patch


 When re-fetching a document (a continued crawl) HttpClient throws an null 
 pointer exception causing the document to be emptied:
 2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
 http://localhost/doc/selfhtml/html/index.htm
 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:138)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
 2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
 http://localhost/doc/selfhtml/html/index.htm failed with: 
 java.lang.NullPointerException
 Because the document is re-fetched the server answers 304 (not modified):
 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] GET /doc/selfhtml/html/index.htm 
 HTTP/1.0 304 174 - Nutch-1.0
 No content is sent in this case (empty http body).
 Index: 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 ===
 --- 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 (revision 979647)
 +++ 
 trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
 (working copy)
 @@ -134,7 +134,8 @@
  if (code == 200) throw new IOException(e.toString());
  // for codes other than 200 OK, we are fine with empty content
} finally {
 -in.close();
 +if (in != null)
 +  in.close();
  get.abort();
}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-906.
-

Fix Version/s: 1.2
   Resolution: Fixed

Fixed in rev. 998261. Thanks!

 Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names 
 not being valid XML tag names
 

 Key: NUTCH-906
 URL: https://issues.apache.org/jira/browse/NUTCH-906
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.1
 Environment: Debian GNU/Linux 64-bit
Reporter: Asheesh Laroia
Assignee: Andrzej Bialecki 
 Fix For: 1.2

 Attachments: 
 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch

   Original Estimate: 0.33h
  Remaining Estimate: 0.33h

 The Nutch FAQ explains that OpenSearch includes all fields that are 
 available at search result time. However, some Lucene column names can start 
 with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch 
 results for a document with a Lucene document column whose name starts with 
 numbers, the underlying Xerces library throws this exception: 
 org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML 
 character is specified. 
 So I have written a patch that tests strings before they are used to generate 
 tags within OpenSearch.
 I hope you merge this, or a better version of the patch!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site

2010-09-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474
 ] 

Andrzej Bialecki  commented on NUTCH-909:
-

bq. It might be better to see the message Search with Apache Solr (as on the 
TIKA's site).

Yes, let's make this uniform.

 Add alternative search-provider to Nutch site
 -

 Key: NUTCH-909
 URL: https://issues.apache.org/jira/browse/NUTCH-909
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Priority: Minor
 Attachments: NUTCH-909.patch


 Add additional search provider (to existed Lucid Find) search-lucene.com. 
 Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1
 According to Andrzej's suggestion, when preparing the patch let's follow the 
 same rationales as those in TIKA-488, since they are applicable here too, so 
 please refer to that issue for more insight on implementation details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118
 ] 

Andrzej Bialecki  commented on NUTCH-880:
-

bq. I think we can combine the approach you outlined in NUTCH-907 with this one.

I'm not sure... they are really not the same things - you can execute many 
crawls with different seed lists, but still using the same Configuration.

bq. What is CLASS ?

It's the same as bin/nutch fully.qualified.class.name, only here I require that 
it implements NutchTool.

bq. Btw, Andrzej, I will be happy to help out with the implementation if you 
want.

By all means - I didn't have time so far to progress beyond this patch...

 REST API (and webapp) for Nutch
 ---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: API.patch


 This issue is for discussing a REST-style API for accessing Nutch.
 Here's an initial idea:
 * I propose to use org.restlet for handling requests and returning 
 JSON/XML/whatever responses.
 * hook up all regular tools so that they can be driven via this API. This 
 would have to be an async API, since all Nutch operations take long time to 
 execute. It follows then that we need to be able also to list running 
 operations, retrieve their current status, and possibly 
 abort/cancel/stop/suspend/resume/...? This also means that we would have to 
 potentially create  manage many threads in a servlet - AFAIK this is frowned 
 upon by J2EE purists...
 * package this in a webapp (that includes all deps, essentially nutch.job 
 content), with the restlet servlet as an entry point.
 Open issues:
 * how to implement the reading of crawl results via this API
 * should we manage only crawls that use a single configuration per webapp, or 
 should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
 ops on them? this would be nice, because it would allow managing of several 
 different crawls, with different configs, in a single webapp - but it 
 complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic 
though. If we didn't already use crawlId I would vote for that (and then rename 
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. 
list and delete at least (create is implicit). There is no such functionality 
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the 
datasetId as a Job property - this way indexing filter plugins can use this 
property to populate NutchDocument fields if needed. FWIW, this may be a good 
idea to do in other jobs as well...

 DataStore API doesn't support multiple storage areas for multiple disjoint 
 crawls
 -

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-907.patch


 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
 page data, linkdb, etc) by specifying a path where the data was stored. This 
 enabled users to run several disjoint crawls with different configs, but 
 still using the same storage medium, just under different paths.
 This is not possible now because there is a 1:1 mapping between a specific 
 DataStore instance and a set of crawl data.
 In order to support this functionality the Gora API should be extended so 
 that it can create stores (and data tables in the underlying storage) that 
 use arbitrary prefixes to identify the particular crawl dataset. Then the 
 Nutch API should be extended to allow passing this crawlId value to select 
 one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

Doğacan, I missed your previous comment... the issue with partial bloom filters 
is usually solved that each task stores each own filter - this worked well for 
MapFile-s because they consisted of multiple parts, so then a Reader would open 
a part and a corresponding bloom filter.

Here it's more complicated, I agree... though this reminds me of the situation 
that is handled by DynamicBloomFilter: it's basically a set of Bloom filters 
with a facade that hides this fact from the user. Here we could construct 
something similar, i.e. don't merge partial filters after closing the output, 
but instead when opening a Reader read all partial filters and pretend they are 
one.

 Design a Host table in GORA
 ---

 Key: NUTCH-882
 URL: https://issues.apache.org/jira/browse/NUTCH-882
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0

 Attachments: hostdb.patch, NUTCH-882-v1.patch


 Having a separate GORA table for storing information about hosts (and 
 domains?) would be very useful for : 
 * customising the behaviour of the fetching on a host basis e.g. number of 
 threads, min time between threads etc...
 * storing stats
 * keeping metadata and possibly propagate them to the webpages 
 * keeping a copy of the robots.txt and possibly use that later to filter the 
 webtable
 * store sitemaps files and update the webtable accordingly
 I'll try to come up with a GORA schema for such a host table but any comments 
 are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

I think the difficulty comes from the simplification in 2.x as compared to 1.x, 
in that we keep a single status per page. In 1.x a side-effect of having two 
locations with two statuses (one db status in crawldb and one fetch status 
in segments) was that we had more information in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary 
fetch status, as determined by fetcher, and a final (reconciled) status as 
determined by updatedb, based on the knoweldge of not only plain fetch status 
and old status but also possible redirects. If I'm not mistaken currently the 
status is immediately overwritten by fetcher, even before we come to updatedb, 
hence the problem..

 Fetcher generates entries with status 0
 ---

 Key: NUTCH-864
 URL: https://issues.apache.org/jira/browse/NUTCH-864
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
 Environment: Gora with SQLBackend
 URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
 Last Changed Rev: 980748
 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
Reporter: Julien Nioche
Assignee: Doğacan Güney
 Fix For: 2.0


 After a round of fetching which got the following protocol status :
 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
 1177 (SUCCESS=1177)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
 93 (EXCEPTION=93)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
 138  (TEMP_MOVED=138)
 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
 521 (MOVED=521)
 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
 There should not be any entries with status 0 (null)
 I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610
 ] 

Andrzej Bialecki  commented on NUTCH-913:
-

There are formatting issues in DomainStatistics.java - the file uses literal 
tabs, which we frown upon, but the patch introduces double-space indent in the 
changed lines. As ugly as it sounds I think this should be changed into tabs, 
and then reformatted in another commit.

Other than that, +1, go for it.

 Nutch should use new namespace for Gora
 ---

 Key: NUTCH-913
 URL: https://issues.apache.org/jira/browse/NUTCH-913
 Project: Nutch
  Issue Type: Bug
  Components: storage
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 2.0

 Attachments: NUTCH-913_v1.patch


 Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace 
 from org.gora to org.apache.gora. This means nutch should use the new 
 namespace otherwise it won't compile with newer builds of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-921:


Attachment: NUTCH-921.patch

Patch that implements reading config parameters from Configuration, and falls 
back to config files if Configuration properties are unspecified.

 Reduce dependency of Nutch on config files
 --

 Key: NUTCH-921
 URL: https://issues.apache.org/jira/browse/NUTCH-921
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: NUTCH-921.patch


 Currently many components in Nutch rely on reading their configuration from 
 files. These files need to be on the classpath (or packed into a job jar). 
 This is inconvenient if you want to manage configuration via API, e.g. when 
 embedding Nutch, or running many jobs with slightly different configurations.
 This issue tracks the improvement to make various components read their 
 config directly from Configuration properties.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >