Re: Nutch 2.0
On 2010-06-28 07:49, Sami Siren wrote: One aspect that has not been discussed yet is the legal aspect. According to http://incubator.apache.org/ip-clearance/index.html there is a formal process for integrating externally development efforts that have happened outside of Apache. Should we be following the ip clearance process in this case too? The concept of a substantial contribution that should be subject to a software grant is somewhat tenuous, though. Keep in mind that you do something equivalent in JIRA already - when you check the Grant license to ASF box you perform a micro-grant. So the question is whether we should go through a full grant or through the JIRA micro-grant. In my opinion it's ok to do the latter, since much of the code is simply a modified version of Nutch classes - not counting GORA, of course, but that part will be added as a third-party lib. So IMHO it's enough to zip all source (without libs), attach it to a JIRA issue and mark the checkbox. Then we follow the process outlined by Chris, which imports the same codebase into our svn. What do you think? If folks agree that this is sufficient, then Dogacan Enis - can you please create a separate JIRA issue, prepare a patch like this, mark the checkbox, and list all dependencies and their licenses for those that are not already in Nutch svn? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Where is nutch 2.0
On 2010-06-29 11:17, Raghavendra Neelekani wrote: Hi Can you please tell me from where can I download nutch 2.0 .? Nutch 2.0 is in the planning and early development phase, so it can't be downloaded yet. We hope to produce a working Nutch 2.0 some time in Q4 2010. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[Nutchbase] WebPage class is a generated code?
Hi, (This question is mostly to Dogacan Enis, but I encourage anyone familiar with the code to join the threads with [Nutchbase] - the sooner the better ;) ). I'm looking at src/gora/webpage.avsc and WebPage.java friends... presumably the java code was autogenerated from avsc using Gora? If so, we should put this autogeneration step in our build.xml. Or am I missing something? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Minimizing the number of stored fields for Solr
On 2010-07-03 10:00, Doğacan Güney wrote: Hey everyone, This is not really a proposition but rather something I have been wondering for a while so I wanted to see what everyone is thinking. Currently in our solr backend, we have stored=true indexed=false fields and stored=true indexed=true fields. The former class of fields are mostly used for storing digest, caching information etc. I suggest that we get rid of all indexed=false fields and read all such data from storage backend. For the latter class of fields (i.e., stored=true indexed=true), I suggest that we set them to stored=false for everything but id field. As an example currently title is stored/indexed in solr while text is only indexed (thus, will need to be fetched from storage backend). But for hbase backend, title and text are already stored close together (in the same column family) so performance hit of reading just text or reading both will likely be same. And removing storage from solr may lead to better caching of indexed fields and may lead to better example. What does everyone think? The issue is not as simple as it looks. If you want to have a good performance for searching snippet generation then you still need to store some data in stored fields - at least url, title, and plain text (not to mention the option to use term vectors in order to speed up the snippet generation). Solr functionality can be also impaired by a lack of data available directly from Lucene storage (field cache, faceting, term vector highlighting). Some fields of course are not useful for display, but are used for searching only (e.g. anchors). These should be indexed but not stored in Solr. And it's ok to get them from non-solr storage if requested, because it's a rare event. The same goes for the full raw content, if you want to offer a cached view - this should not be stored in Solr but instead it should come from a separate layer (note that sometimes cached view might not be in the original format - pdf, office, etc - and instead an html representation may be more suitable, so in general the cached view shouldn't automatically equal the original raw content). But for other fields I would argue that for now they should remain stored in Solr, *even the full text*, until we figure out how they affect the ability and performance of common search operations. E.g. if we remove the stored title field then we need to reach to the storage layer in order to display each page of results... not to mention issues like highlighting, faceting, function queries and a host of other functionalities that Solr can offer just because a field is stored in its index. So I'm -0 to this proposal - of course we should review our schema, and of course we should have a mechanism to get data from the storage layer, but what you propose is IMHO a premature optimization at this point. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
YCSB benchmark for KV stores
Hi, Found this link: http://wiki.github.com/brianfrankcooper/YCSB/papers-and-presentations Would be cool to run the benchmark for the same stores but via Gora. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Parse-tika ignores too much data...
On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. The HTML was broken on purpose - one of the goals of the original test was to get as much content and links in presence of grave errors - as you know even major sites often produce a badly broken HTML, but the parser sanitize it and produce a valid DOM. In this case, it produced two nested body elements, which is not valid. I should also mention that NekoHTML handled this test much better, by removing the body and retaining only the frameset. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Merging in nutchbase
On 2010-07-10 15:00, Doğacan Güney wrote: Hey everyone, I would like to start merging in nutchbase to trunk, so I am hoping to get everyone's comments and suggestions on how to do that. Do we have any way to run the merged code without running HBase? I think that the SQL backend to Gora needs to be tested first with the nutchbase branch - otherwise the development and testing will become very difficult... So in my opinion we need to make sure we can use a small SQL backend (Derby or HSQL) before we start merging. As for the mechanics of the patching - yes, I think it needs to be done this way. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Merging in nutchbase
On 2010-07-10 17:01, Doğacan Güney wrote: Hey everyone, On Sat, Jul 10, 2010 at 17:43, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, +1 to Andrzej’s suggestion. I mostly run small scale stuff with Nutch, so unless I can run HBase in small scale (or better yet, an embedded SQL db), I won’t be as much use! :) I just want to make clear that this is, indeed, a goal I share. Gora already has an SQL backend that can use embedded hsqldb. However, there are some weird bugs (I really hate SQL :), but once I am done fixing all bugs (which I will be doing today and tomorrow), nutch will run on gora - (embedded hsqldb) with zero configuration. Excellent, that would be a real breakthrough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Component fetching during parsing. (vertical crawling)
On 2010-07-20 14:30, Ferdy wrote: Hello, We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for example let's say the QueueFeeder submits a html page to the fetcher, and after the fetch the page gets parsed. Nothing special so far. However the parser decides it also needs some images on the page. Perhaps these images link to other html pages, and we might want to fetch these too. All this is needed to parse information about this particular url we started with. These extra fetch urls we like to call Components, because they are additional resources required to do the parsing of our initial html page that was selected for fetching. At first we tried to solve this vertical crawling problem by using multiple crawl cycles. Each crawl simply selects outlinks that are needed for the parsing of the initial html page. A single inspection can possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph depth). There are several problems with this approach, for one that the crawldb is cluttered with all these component urls and secondly that inspection completion times can be very long. As an alternative we decided to let the parser fetch needed components on-the-fly, so that additional urls are instantly added to the fetcher lists. Every fetched url can be either a non-component (the QueueFeeder fed it; start parsing this resource) or as a component (the fetcher hands the resource over to the parser that requested it). In order to keep parsers alive we always try to fetch components first, with respect to fetch politeness. A downside of this solution is that your fetch task total running time will be more difficult to anticipate to. For example, if you inject and generate 100 urls and they will be fetched in a single task, you might end up fetching a total of 1100 urls (in the assumption each inspection needs 10 components). We found this behaviour to be acceptable. Because of our custom version of nutch we cannot upgrade easily to newer versions (we're still using modified fetcher classes from nutch 0.9). Often we end up fixing bugs that have already been fixed by the community. Also, other users might benefit from our changes too. Therefore we propose to redesign our vertical crawling system from scratch for the newer nutch versions, should there be any interest from the community. Perhaps we are not the only one to implement such a system with nutch. So, what are your thoughts about this? If I understand your use case properly, this is really a custom Fetcher that you are talking about - a strategy to fetch complete pages (together with its resources that relate to the display of the page) should be possible to implement in a custom fetcher without changing other Nutch areas. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutchbase merge strategy
On 2010-07-21 21:12, Mattmann, Chris A (388J) wrote: Hey Andrzej, +1 to all of the above - see below. So if 1-4 make sense, let's do 1, 2 and 3 today or tomorrow -- 4 can happen over the next few weeks. WDYT? This is a serious move - let's wait a bit, say until Monday, to give chance to others to comment. Agreed. Let's wait until Monday. If there aren't any objections, let's let er' rip! BTW, #4 is independent of #1-3. WDYT about wrapping up the 1.x series of Nutch and rolling a 1.2 in the next few days (while I have some free cycles)? :) #4 is also in its own branch and therefore independent as well so it won't be as brave a move. Let me know what you (all) think. If 1.2 is going to be the last release in 1.x series then I think we should review some pending issues, especially those reported after 1.0 release: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=10680updated%3Aprevious=-1wcreated%3Aafter=1%2FApr%2F09status=1status=3status=4sorter/field=updatedsorter/order=DESC Actually, just two issues are still unresolved... hmm, not bad. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[Nutchbase] Multi-value ParseResult missing
Hi, I noticed that nutchbase doesn't use the multi-valued ParseResult, instead all parse plugins return a simple Parse. As a consequence, it's not possible to return multiple values from parsing a single WebPage, something that parsers for compound documents absolutely require (archives, rss, mbox, etc). Dogacan - was there a particular reason for this change? However, a broader issue here is how to treat compound documents, and links to/from them: a) record all URLs of child documents (e.g. with the !/ notation, or # notation), and create as many WebPage-s as there were archive members. This needs some hacks to prevent such urls from being scheduled for fetching. b) extend WebPage to allow for multiple content sections and their names (and metadata, and ... yuck) c) like a) except put a special synthetic mark on the page to prevent selection of this page for generation and fetching. This mark would also help us to update / remove obsolete sub-documents when their container changes. I'm leaning towards c). Now, when it comes to the ParseResult ... it's not an ideal solution either, because it means we have to keep all sub-document results in memory. We could avoid it by implementing something that Aperture uses, which is a sub-crawler - a concept of a parser plugin for compound formats. The main plugin would return a special result code, which basically says this is a compound format of type X, and then the caller (ParseUtil?) would use SubCrawlerFactory.create(typeX, containerDataStream) to create a parser for the container. This parser in turn would simply extract sections of the compound document (as streams) and it would pass each stream to the regular parsing chain. The caller then needs to iterate over results returned from the SubCrawler. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Benchmark of Nutch trunk
Hi, We have a simple crawling benchmark now in trunk. Here's how to use it: * in one console execute 'ant proxy'. This will start on port 8181 a proxy server that produces fake pages. * in another console execute 'ant benchmark'. This will run 5 rounds of fetching (~16,000 pages) using that proxy server. There are already some interesting issues I noticed. First, on a reasonably good hardware in local mode I was able to fetch and process (NOTE: this includes ALL steps, i.e. generate, fetch, parse, crawldb update and invertlinks) 16k pages in 400 sec. This means a total crawling throughput of 40 pages/sec. This is in local mode, so in distributed mode I guess we would be getting this number times the number of tasks. Secondly, it seems that Fetcher has some synchronization issues in its queue management - even if other queues are non-empty, but one of the queues blocks, the Fetcher will spin-wait all threads until an item becomes available on that queue, and then it starts to happily consume items from all non-blocking queues (including this one). The process then repeats - one queue blocks, and all threads stop getting items from other queues... At the moment I can't figure out where this lock-up is happening, but the symptoms are obvious when you look at the logs in real-time. More stuff to come on this subject - at least we have a tool to experiment with :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Seeking Insight into Nutch Configurations
On 2010-08-02 10:17, Scott Gonyea wrote: The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably because they are waiting on that map to finish? I really don't know and it's frustrating. Yes, all map tasks need to finish before reduce tasks are able to proceed. The reason is that each reduce task receives a portion of the keyspace (and values) according to the Partitioner, and in order to prepare a nice key, list(value) in your reducer it needs to, well, get all the values under this key first, whichever map task produced the tuples, and then sort them. The failing tasks probably fail due to some other factor, and very likely (based on my experience) the failure is related to some particular URLs. E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB long, or containing '\0' etc, etc. In my experience, it's best to keep regex filtering to a minimum if you can, and use other urlfilters (prefix, domain, suffix, custom) to limit your crawling frontier. There are simply too many ways where a regex engine can lock up. Please check the logs of the failing tasks. If you see that a task is stalled you could also log in to the node, and generate a thread dump a few times in a row (kill -SIGQUIT pid) - if each thread dump shows the regex processing then it's likely this is your problem. My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh) Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would be more performant, yet it doesn't appear to affect matters. Cost vs Speed: I don't mind throwing EC2 instances at this to get it done quickly... But I can't imagine I need much more than 10-20 mid-size instances for this. That's correct - with this number of unique sites the max. throughput of your crawl will be ultimately limited by the politeness limits (# of requests/site/sec). Can anyone share their own experiences in the performance they've seen? There is a very simple benchmark in trunk/ that you could use to measure the raw performance (data processing throughput) of your EC2 cluster. The real-life performance, though, will depend on many other factors, such as the number of unique sites, their individual speed, and (rarely) the total bandwidth at your end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Seeking Insight into Nutch Configurations
On 2010-08-02 22:59, Scott Gonyea wrote: By the way, can anyone tell me if there is a way to explicitly limit how many pages should be fetched, per fetcher-task? I believe that in general case it would be a very complex problem to solve so that you get exact results. The reason is that Nutch doesn't use any global lock manager, so the only way to ensure a proper per-host locking is to assign all URL-s from any given host to the same map task. This may (and often will) create an imbalance in the number of allocated URL-s per task. One method to mitigate this imbalance is to set generate.max.count (in trunk, generate.max.per.host in 1.1) - this will limit the number of URL-s from any given host to X, thus helping in a more balanced mixing of these N per-host chunks across M maps. I think part of the problem is that, seemingly, Nutch seems to be generating some really unbalanced fetcher tasks. The task (task_201008021617_0026_m_00) had 6859 pages to fetch. Each higher-numbered task had fewer pages to fetch. Task 000180 only had 44 pages to fetch. There's no specific tool to examine the composition of fetchlist parts... try running this in the segments/2010*/crawl_generate/: for i in part-00* do echo part $i - strings $i | grep http:// done to print URL-s per map task. Most likely you will see that there was no other way to allocate the URLs per task to satisfy the constraint that I explained above. If it's not the case, then it's a bug. :) This *huge* imbalance, I think, creates tasks that are seemingly unpredictable. All of my other resources just sit around, wasting resources, until one task grabs some crazy number of sites. Again, generate.max.count is your friend - even though you won't be able to get all pages from a big site in one go, at least your crawls will finish quickly and you will quickly progress breadth-wise, if not depth-wise. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[Nutchbase] jmxtools issue...
Hi, I can't compile nutchbase at the moment - ivy has trouble finding jmxri.jar and jmxtools.jar ... I found jmxri.jar somewhere and put it to my .ivy2/local, but I can't find jmxtools.jar ... Anyway, why do we need these two jars at all??? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Hsqldb 2.0 conflicts with Hsqldb 1.8 in Hadoop
Hi, I was trying to run Benchmark in trunk using MySQL, on a standalone Hadoop cluster. My conf/gora.properties has this: gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?user=nutchpassword=nutch Jobs were failing though, with the following: Exception in thread main java.lang.NoSuchMethodError: org.hsqldb.DatabaseURL.parseURL(Ljava/lang/String;ZZ)Lorg/hsqldb/persist/HsqlProperties; at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source) at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:207) at org.gora.sql.store.SqlStore.getConnection(SqlStore.java:712) at org.gora.sql.store.SqlStore.initialize(SqlStore.java:145) at org.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:64) at org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:86) at org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:98) at org.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:70) at org.apache.nutch.storage.StorageUtils.createDataStore(StorageUtils.java:25) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:68) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:237) at org.apache.nutch.tools.Benchmark.benchmark(Benchmark.java:190) at org.apache.nutch.tools.Benchmark.run(Benchmark.java:139) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.tools.Benchmark.main(Benchmark.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Isn't this puzzling... It turns out that java.sql.DriverManager will try _all_ drivers in turn to see which one can handle the jdbcUrl, and the usual magic of Class.forName(jdbcDriver) doesn't mean we are going to use jdbcDriver, it's just to make sure the driver class was loaded and registered itself on the list of available drivers. Now, I know why the particular error occured - Hadoop includes HSQLDB 1.8, and we use HSQLDB 2.0. When DriverManager tries each driver in turn, unfortunately Hsqldb is first on the classpath (it comes in Hadoop/lib), and MySQL is the last, so it bombs out even before trying the right driver... For now I changed my build.xml to this: Index: build.xml === --- build.xml (revision 983564) +++ build.xml (working copy) @@ -123,7 +123,7 @@ excludes=nutch-default.xml,nutch-site.xml/ zipfileset dir=${conf.dir} excludes=*.template,hadoop*.*/ zipfileset dir=${build.lib.dir} prefix=lib - includes=**/*.jar excludes=hadoop-*.jar/ + includes=**/*.jar excludes=hadoop-*.jar,hsqldb*.jar/ zipfileset dir=${build.plugins} prefix=plugins/ /jar /target -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Tika HTML parsing
On 2010-08-15 06:54, Ken Krugler wrote: For what it's worth, I just committed some patches to Tika that should improve Tika's ability to extract HTML outlinks (in img and frame elements, at least). Support for iframe should be coming soon :) This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm tracking down, but I think Tika is getting closer to being usable by Nutch for typical web crawling. Thanks Ken for pushing forward this work! A few questions: * does this include image maps as well (area)? * how does the code treat invalid html with both body and frameset? * what's the status of extracting the meta robots and link rel information? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Alternative search box for Nutch site
On 2010-08-30 12:21, Otis Gospodnetic wrote: Hello peeps, We've created a patch for Tika and got some good and constructive feedback (see https://issues.apache.org/jira/browse/TIKA-488 ). Should we follow the same functionality pattern for nutch.apache.org as seen in TIKA-488? Sure, why not - when preparing the patch let's follow the same rationales as those in TIKA-488, since they are applicable here too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch 2.0 (trunk)
On 2010-09-07 14:50, Faruk Berksöz wrote: Dear all, wenn i try to fetch a web page (e.g. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage definition, I am seeing the following error in my hadoop logs. , (no error with hbase ) ; java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316) at org.gora.sql.store.SqlStore.close(SqlStore.java:163) at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) The type of the column 'content' is BLOB. It may be important for the next developments of Gora. Should I file this in nutch-jira or hithub/gora or nothing? environments : ubuntu 10.04 JVM : 1.6.0_20 nutch 2.0 (trunk) Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed Yes, please create a JIRA issue. Thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Nutch 1.2 Release Candidate #1
On 2010-08-09 16:45, Julien Nioche wrote: I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be good to fix it before releasing 1.2 This is fixed. How about doing the release now? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Nutch 1.2 Release Candidate #4
On 2010-09-24 04:38, Mattmann, Chris A (388J) wrote: Hi Nutch PMC: /nudge Anyone get a chance to review this yet? I have some free cycles tomorrow and would really think it’s cool if I could finally push out the 1.2 RC. I had little time this week, but I'm testing it now... I should be done tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Nutch 1.2 Release Candidate #4
On 2010-09-24 20:40, Mattmann, Chris A (388J) wrote: Thanks Andrzej, appreciate it. I know you’ve been really vigilant with the other RCs I’ve thrown up about testing and I appreciate it. Other Nutch PMC’ers: just need one more VOTE. Help, please? :) +1, all unit tests pass, and a test crawl + indexing to Solr went just fine. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Build failed in Hudson: Nutch-trunk #1280
On 2010-10-19 06:01, Apache Hudson Server wrote: [Nutch-trunk] $ /bin/bash -xe /tmp/hudson7277994413075810777.sh + PATH=/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/ucb:/usr/local/bin:/usr/bin:/usr/sfw/bin:/usr/sfw/sbin:/opt/sfw/bin:/opt/sfw/sbin:/opt/SUNWspro/bin:/usr/X/bin:/usr/ucb:/usr/sbin:/usr/ccs/bin + export ANT_HOME=/export/home/hudson/tools/ant/latest + ANT_HOME=/export/home/hudson/tools/ant/latest + export PATH ANT_HOME + cd trunk + /export/home/hudson/tools/ant/latest/bin/ant -Dversion=2010-10-19_04-00-41 -Dtest.junit.output.format=xml nightly /tmp/hudson7277994413075810777.sh: line 7: /export/home/hudson/tools/ant/latest/bin/ant: No such file or directory Do you know guys why the automated builds are failing? Looks like Ant is not where the build script expects it to be... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ReviewBoard Instance
On 2010-10-26 15:53, Mattmann, Chris A (388J) wrote: Hi Guys, Gav from infra@ set up a ReviewBoard instance for Apache [1]. I've never used it before but I thought I'd request an account on it for Nutch [2] regardless, so if folks want to use it, they can. Hmm, I may be missing something... but what's the point of using the tool in our JIRA-based workflow? It looks to me like it duplicates at least part of JIRA's functionality, and the remaining part is what we do also in JIRA by convention... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Java.io.IOException with multiple copyField/ directives
On 2010-12-03 09:52, Peter Litsegård wrote: Hi! I've run into a strange behaviour while using Nutch (solrindexer) together with Solr 1.4.1. I'd like to copy the 'title' and 'content' field to another field, say, 'foo'. In my first attempt I added the copyField/ directives in schema.xml and got the java exception so I removed them from schema.xml. In my second attempt I added the copyField/ directives to the 'solrindex-mapping.xml' file and ran into the same exception again! Is this a known issue or have I stumbled into unknown territory? Any workarounds? I suspect that the field type declared in your schema.xml is not multiValued. What was the exception? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Does Nutch 2.0 in good enough shape to test?
(switching to devs) On 12/17/10 10:18 AM, Alexis wrote: Hi, I've spent some time working on this as well. I've just put together a blog entry addressing the issues I ran into. See http://techvineyard.blogspot.com/2010/12/build-nutch-20.html In a nutchsell, I changed three pieces in Gora and Nutch code: - flush the datastore regularly in the Hadoop RecordWriter (in GoraOutputFormat) Careful here. DataStore flush may be very expensive, so it should be done only when we are finished with the output. If you see that data is lost without this flush then this should be reported as a Gora bug. - wait for Hadoop job completion in the Fetcher job I missed your previous email... I'll fix this shortly - thanks for spotting it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Gora/HBase dependencies and deploy artifacts
Hi all, Recently I've been deploying Nutch trunk to an already existing Hadoop cluster. And immediately I hit a snag. Nutch was configured to use gora-hbase. The nutch.job jar doesn't include gora-hbase even if it was configured in nutch-site.xml. Furthermore, gora-hbase depends on HBase and its dependencies, which need to be found on classpath. Typically for development and testing I solved this issue by deploying gora-core and gora-hbase + all hbase libs to hadoop/lib across the cluster. This is a bit dirty - Hadoop clusters should be seen as a generic computing fabric, so they should be application-agnostic, besides this creates maintenance ops issues. We could put all these libs in lib/ inside nutch.job, so that they are unpacked and put on classpath during task setup. This would work fine for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that InputFormat / OutputFormat classes were initialized prior to this unpacking - and in our case these depend on the libs in as-yet-unpacked job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in Hadoop 0.20.2, so his is something that needs to be tested). Furthermore, even if we packed the jars in lib/ inside nutch.job, still many tools wouldn't work, because they depend on classes from those libs during the local execution (before the job is sent to task trackers), and the URLClassLoader can't load classes from jars within jars... A workaround for this would be to take all those jars and re-pack them together under / directory in nutch.job. This would satisfy the dependencies for local execution, and for Mapper/Reducer execution but I'm not sure if it solves the problem of Input/OutputFormat-s that I mentioned above. In short, we need a clear working procedure how to deploy Gora backend implementations so that they work with Nutch and with a generic unmodified Hadoop cluster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3
On 3/10/11 10:57 PM, Julien Nioche (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-951. --- NUTCH-825 committed in revision 1080368 All the known improvements from 2.0 have been backported into 1.3 now The only remaining issue to address before rolling out a 1.3 release is NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Differences 1.x and trunk
On 3/18/11 4:31 PM, Markus Jelsma wrote: Hi all, I'm giving it a try to patch https://issues.apache.org/jira/browse/NUTCH-963 to trunk after committing to 1.3. There are of course a lot of differences so i need a little advice on how to procede: - instead of using CrawlDB and CrawlDatum we now need WebTableReader? Actually you need to use StorageUtils to set up Mapper or Reducer contexts. See other tools, e.g. Fetcher or Generator. - trunk uses slf instead of commons logging now? Yes. - a page is now represented by storage.WebPage? Yes. When you prepare a Job you also need to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Move 2.0 out of trunk
On 18/09/2011 02:21, Julien Nioche wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. +1 - at this time it's clear that 2.0 didn't pan out as we expected, and we should restart from the 1.x for a usable platform, and continue redesign from that codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?
On 12/10/2011 13:17, Markus Jelsma (Commented) (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125717#comment-13125717 ] Markus Jelsma commented on NUTCH-797: - This test was on a local instance. I tried both values for parser.fix.embeddedparams with: $ bin/nutch parsechecker http://www.funkybabes.nl/;ROOOWAN/fotoboek Is this how it should be implemented? I'm not sure. Embedded params are a bit puzzling :) Hmm ... if that's the exact command-line expression that you entered then if you are using a *nix shell the semicolon would mean the end of command, so in fact what was executed would be: $ bin/nutch parsechecker http://www.funkybabes.nl/ ...lots of output ... bash: ROOOWAN/fotoboek: command not found -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS
On 05/11/2011 06:44, Mattmann, Chris A (388J) wrote: Hey Guys, I modified the Jenkins jobs that Lewis set up to now: * poll SCM hourly for changes to Nutch * publish Maven snapshots (1.5-SNAPSHOT) and above of Nutch to repository.apache.org Very useful - thanks a lot! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Persistent problems with Ivy dependencies in Eclipse
On 10/11/2011 04:39, Lewis John Mcgibbney wrote: Gets even more strange, both SWFParser and AutomationURLFilter import additonal depenedencies, however they are not included within thier plugin/ivy/ivy.xml files! Am I missing something here? Most likely these problems come from the initial porting of a pure ant build to an ant+ivy build. We should determine what deps are really needed by these plugins, and sanitize the ivy.xml files so that they make sense - if the existing files can't be untangled we can ditch them and come up with new, clean ones. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Signature == null ?
On 15/11/2011 20:33, Markus Jelsma wrote: It's back again! Last try if someone has a pointer for this. Cheers After some DB updates, they're gone! Anyone recognizes this phenomenon? On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote: On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote: Hi guys, I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records and their signatures. I had to add a sanity check on signature to avoid a NPE. I had the assumption any record with such DB_ status has to have a signature, right? Why does roughly 0.0001625% of my records exit without a signature? Now with correct metrics: Why does roughly 0.84% of my records exist without a signature? This could be somehow related to pages that come from redirects so that when they are fetched they are accounted for under different urls, which in turn may confuse the update code in CrawlDbReducer... Do you notice any pattern to these pages? What's their origin? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dependency Injection
On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack, but I'll try to write that up more formally and send it out to get some kind of feedback. This sounds interesting. As Chris mentioned, the current plugin system is far from ideal, but so far it worked reasonably well. The key functionality that it implements is: * self-discovery of services provided by each plugin, * easy pluggability, by the virtue of dropping super-jars (jars with impl. classes and nested library jars) to a predefined location, * controlled classloader isolation between plugins so that incompatible versions of libraries can be used * but also ability to export specified classes and libraries so that one plugin can use other plugin's exported resources on its classpath. * optional auto-loading of dependent plugins In the past one contributor made a bold attempt to port Nutch to OSGI, and it turned out to be much more complicated than we expected, and with a bigger impact on the way Nutch applications were supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Dependency Injection
On 23/11/2011 01:02, Andrzej Bialecki wrote: On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack, but I'll try to write that up more formally and send it out to get some kind of feedback. This sounds interesting. As Chris mentioned, the current plugin system is far from ideal, but so far it worked reasonably well. The key functionality that it implements is: * self-discovery of services provided by each plugin, * easy pluggability, by the virtue of dropping super-jars (jars with impl. classes and nested library jars) to a predefined location, * controlled classloader isolation between plugins so that incompatible versions of libraries can be used * but also ability to export specified classes and libraries so that one plugin can use other plugin's exported resources on its classpath. * optional auto-loading of dependent plugins In the past one contributor made a bold attempt to port Nutch to OSGI, and it turned out to be much more complicated than we expected, and with a bigger impact on the way Nutch applications were supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. You may also want to take a look at JSPF (http://code.google.com/p/jspf) which perhaps could be made to satisfy the above requirements without too much refactoring. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Upgrading to Hadoop 0.22.0+
On 13/12/2011 17:42, Lewis John Mcgibbney wrote: Hi Markus, I'm certainly in agreement here. If you like to open a Jira, we can begin the build up a picture of what is required. Lewis On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, To keep up with the rest of the world i believe we should move from the old Hadoop mapred API to the new MapReduce API, which has already been done for the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily done in Ivy but all jobs must be tackled and we have many jobs! Anyone to give pointers and helping hand in this large task? I guess the question is also whether the 0.22 is compatible enough to compile more or less with the existing code that uses the old api. If it does, then we can do the transition gradually, if it doesn't then it's a bigger issue. This is easy to verify - just drop in the 0.22 jars and see if it compiles / tests are passing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Upgrading to Hadoop 0.22.0+
On 13/12/2011 18:04, Markus Jelsma wrote: Hi I did a quick test to see what happens and it won't compile. It cannot find our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but won't run and many tests fail with stuff like. Exception in thread main java.lang.NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException at org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421) Hmm... what's that? I don't see this class (or this package) in the Nutch tree. Also, trunk doesn't use JSON for anything as far as I know. at org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.JsonMappingException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 4 more I think this can be overcome but we cannot hide from the fact that all jobs must be ported to the new API at some point. You did some work on the new API's, did you come across any cumbersome issues when working on it? It was quite some time ago .. but I don't remember anything being really complicated, it was just tedious - and once you've done one class the other classes follow roughly the same pattern. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 14/12/2011 16:01, Markus Jelsma wrote: This is highly annoying, MapFileOutputFormat is not present in the MapReduce API until 0.21! AFAIK that's not the case ... there is both an old api and a new api implementation (the old one is deprecated). The new api is in org.apache.hadoop.mapreduce.lib.output . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 14/12/2011 18:30, Markus Jelsma wrote: proper link: http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapreduce/lib/output/package-summary.html I thought the goal was to upgrade to 0.22, where this class is present. In 0.20.205 org.apache.hadoop.mapred.MapFileOutputFormat still uses the old api, and it's not deprecated yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API
On 15/12/2011 13:13, Markus Jelsma wrote: hmm, i don't see how i can use the old mapred MapOutputFormat API with the new Job API. job.setOutputFormatClass(MapFileOutputFormat.class) expects an the mapreduce.lib.output.MapFileOutputFormat class and won't accept the old API. setOutputFormatClass(java.lang.Class? extends org.apache.hadoop.mapreduce.OutputFormat) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Classorg.apache.hadoop.mapred.MapFileOutputFormat) In short, i don't know how i can migrate jobs to the new API on 0.20.x without having MapFileOutputFormat present in the new API. Trying to set to old mapoutputformat Ah, no, that's now what I meant ... of course you need to change the code to use the new api, and the new code will look quite different :) my point was only that it is different in a consistent way, so after you've ported one or two classes the other ones are easy to convert, too... I'm bogged with other work now, but I'll see if I can prepare an example later today... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Build failed in Jenkins: Nutch-trunk #1706
On 28/12/2011 12:00, Lewis John Mcgibbney wrote: Hi Guys, Pretty strange compilation failure, this test class hasn't been hacked in months, and from the surface, having looked at the test case there appears to be no obvious reasons for it failing to compile. I've kick started another build on Jenkins to see if it will resolve itself. I don't think it will - I can reproduce this failure locally. Here's what fixed the failure for me (I'm pretty ignorant about ivy/maven so there's likely a more correct fix for this): Index: ivy/ivy.xml === --- ivy/ivy.xml (revision 1225046) +++ ivy/ivy.xml (working copy) @@ -69,7 +69,7 @@ !--Configuration: test -- !--artifacts needed for testing -- - dependency org=junit name=junit rev=3.8.1 conf=test-default / + dependency org=junit name=junit rev=3.8.1 conf=*-default / dependency org=org.apache.hadoop name=hadoop-test rev=0.20.205.0 conf=test-default / -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Build failed in Jenkins: Nutch-trunk #1706
On 28/12/2011 14:15, Lewis John Mcgibbney wrote: Hi Andrzej, Can anyone confirm? I've tried this patch locally and although I couldn't reproduce the original issue, it seems to be working fine for me as well. Check your lib/ dir, maybe you have a local copy of junit jar that gets pulled on the classpath and masks the issue? this happened to me once or twice... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay
On 02/03/2012 12:45, Lewis John Mcgibbney wrote: Hi Guys, As there were some comments on the user list, I recently got digging with http redirects then stumbled across NUTCH-1042. Although these are individual issues e.g. redirects and crawl delays, I think they are certainly linked, however what is interesting is that users 'usually' don't consider them to be interlinked as such and therefore struggle to debug how and why either the redirect or the crawl delay pages are not being fetched. Doing some more digging I found the now rather old and tatty NUTCH-475, which obviously got me thinking about how we maintain the AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking about the following - Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042 still needs fixed as this is obviously becoming a bit of a pain for some users. Yes. - Can someone shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) Fetcher2 is the current Fetcher. The original Fetcher was temporarily renamed OldFetcher and then removed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: question about ObjectCache
On 10/04/2012 05:00, Xiaolong Yang wrote: Hi,all I'm reading source code of nutch and I have some puzzled about the ObjectCache.java in package org.apache.nutch.util.I just find it may be little benefit to use it in urlnormalizers and urlfiters.I also have read some discuss about cache in Nutch-169 and Nutch-501.But I can't understand it. Can anyone tell me where ObjectCache be used and get a good benefit in nutch ? ObjectCache is designed to cache ready-to-use instances of Nutch plugins. The process of finding, instantiating and initializing plugins is inefficient, because it involves parsing plugin descriptors, initializing plugins, collecting the ones that implement correct extension points, etc. It would kill performance if this process were invoked each time you want to run all plugins of a given type (e.g. URLNormalizer-s). The facade URLNormalizers/URLFilters and others make sure that plugin instances of a given type are initialized once per lifetime of a JVM, and then they are cached in ObjectCache, so that next time you want to use them they can be retrieved from a cache, instead of going again through the process of parsing/instantiating/initializing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-650) Hbase Integration
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883559#action_12883559 ] Andrzej Bialecki commented on NUTCH-650: - So far as one can digest such a giant patch ;) I think this is ok, at least from the legal POV it clarifies the situation and it doesn't bring any dependencies with incompatible licenses. As for the content itself, we'll need to resolve this incrementally, as discussed on the list. So, a cautious +1 from me to apply this on branches/nutchbase. Hbase Integration - Key: NUTCH-650 URL: https://issues.apache.org/jira/browse/NUTCH-650 Project: Nutch Issue Type: New Feature Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.0 Attachments: hbase-integration_v1.patch, hbase_v2.patch, latest-nutchbase-vs-original-branch-point.patch, latest-nutchbase-vs-svn-nutchbase.patch, malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch This issue will track nutch/hbase integration -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-837: --- Assignee: Andrzej Bialecki Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: NUTCH-837.patch Updated patch against r959954 (after NUTCH-836). Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch, NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: (was: NUTCH-837.patch) Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729 ] Andrzej Bialecki commented on NUTCH-837: - bq. So, I think we should still have a Nutch webapp and in my mind it's a must-have for a 2.0 release... I agree. But for the moment it's better to delete the old webapp stuff that we know for sure doesn't work with the current Nutch, and it will be completely reimplemented anyway. Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-837. - Resolution: Fixed Committed in r960064. Thanks for review! Remove search servers and Lucene dependencies -- Key: NUTCH-837 URL: https://issues.apache.org/jira/browse/NUTCH-837 Project: Nutch Issue Type: Task Components: searcher, web gui Affects Versions: 1.1 Reporter: Julien Nioche Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-837.patch One of the main aspects of 2.0 is the delegation of the indexing and search to external resources like SOLR. We can simplify the code a lot by getting rid of the : * search servers * indexing and analysis with Lucene * search side functionalities : ontologies / clustering etc... In the short term only SOLR / SOLRCloud will be supported but the plan would be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-821) Use ivy in nutch builds
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885188#action_12885188 ] Andrzej Bialecki commented on NUTCH-821: - I think this patch refers to some parts that were already removed in NUTCH-837 ... Also, it would be nice to have a target that sets up an Eclipse project - after this patch is applied the lib/ is nearly empty and you need to run build at least once to bring dependencies - this may be confusing. Use ivy in nutch builds --- Key: NUTCH-821 URL: https://issues.apache.org/jira/browse/NUTCH-821 Project: Nutch Issue Type: New Feature Components: build Affects Versions: 2.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch Ivy is the de-facto dependency management tool used in conjunction with Ant. It would be nice if we switch to using Ivy in Nutch builds. Maven is also an alternative, but I think Nutch will benefit more with an Ant+Ivy architecture. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-696: Attachment: timeout.patch A simple patch that implements the strategy outlined here http://bit.ly/bdTYrS - I've been recently suffering from this issue, so this is better than nothing. Julien's strategy would work, too, but then the job takes much longer to execute. Timeout for Parser -- Key: NUTCH-696 URL: https://issues.apache.org/jira/browse/NUTCH-696 Project: Nutch Issue Type: Wish Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: timeout.patch I found that the parsing sometimes crashes due to a problem on a specific document, which is a bit of a shame as this blocks the rest of the segment and Hadoop ends up finding that the node does not respond. I was wondering about whether it would make sense to have a timeout mechanism for the parsing so that if a document is not parsed after a time t, it is simply treated as an exception and we can get on with the rest of the process. Does that make sense? Where do you think we should implement that, in ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885257#action_12885257 ] Andrzej Bialecki commented on NUTCH-696: - Yes - this patch is a quick solution that allowed me to complete a crawl. If people feel this is useful, let's polish it. Timeout for Parser -- Key: NUTCH-696 URL: https://issues.apache.org/jira/browse/NUTCH-696 Project: Nutch Issue Type: Wish Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: timeout.patch I found that the parsing sometimes crashes due to a problem on a specific document, which is a bit of a shame as this blocks the rest of the segment and Hadoop ends up finding that the node does not respond. I was wondering about whether it would make sense to have a timeout mechanism for the parsing so that if a document is not parsed after a time t, it is simply treated as an exception and we can get on with the rest of the process. Does that make sense? Where do you think we should implement that, in ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-696: - This may be useful after all - let's gather more comments. Timeout for Parser -- Key: NUTCH-696 URL: https://issues.apache.org/jira/browse/NUTCH-696 Project: Nutch Issue Type: Wish Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: timeout.patch I found that the parsing sometimes crashes due to a problem on a specific document, which is a bit of a shame as this blocks the rest of the segment and Hadoop ends up finding that the node does not respond. I was wondering about whether it would make sense to have a timeout mechanism for the parsing so that if a document is not parsed after a time t, it is simply treated as an exception and we can get on with the rest of the process. Does that make sense? Where do you think we should implement that, in ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-696) Timeout for Parser
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885295#action_12885295 ] Andrzej Bialecki commented on NUTCH-696: - I agree, ultimately that's the way to go. However, I needed something _now_, and the patch helps to solve the problem that I have now - and until this problem is solved in Tika this patch provides some kind of band-aid for us poor Nutch-ers... Timeout for Parser -- Key: NUTCH-696 URL: https://issues.apache.org/jira/browse/NUTCH-696 Project: Nutch Issue Type: Wish Components: fetcher Reporter: Julien Nioche Priority: Minor Attachments: timeout.patch I found that the parsing sometimes crashes due to a problem on a specific document, which is a bit of a shame as this blocks the rest of the segment and Hadoop ends up finding that the node does not respond. I was wondering about whether it would make sense to have a timeout mechanism for the parsing so that if a document is not parsed after a time t, it is simply treated as an exception and we can get on with the rest of the process. Does that make sense? Where do you think we should implement that, in ParseUtil? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-821) Use ivy in nutch builds
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885583#action_12885583 ] Andrzej Bialecki commented on NUTCH-821: - +1 for this patch for now - all good comments, there's plenty of improvements we can make, so let's line them up as separate issues. Use ivy in nutch builds --- Key: NUTCH-821 URL: https://issues.apache.org/jira/browse/NUTCH-821 Project: Nutch Issue Type: New Feature Components: build Affects Versions: 2.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 Attachments: NUTCH-821.patch, nutchbase-ivy_v1.patch Ivy is the de-facto dependency management tool used in conjunction with Ant. It would be nice if we switch to using Ivy in Nutch builds. Maven is also an alternative, but I think Nutch will benefit more with an Ant+Ivy architecture. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-843) Separate the build and runtime environments
Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch This patch moves bin/nutch to src/bin/nutch, and creates /runtime/deploy and /runtime/local areas, populated with the right pieces. bin/nutch has been modified to work correctly in both cases. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015 ] Andrzej Bialecki commented on NUTCH-843: - We need to create the job file anyway. Actually, the patch I attached does something like this for the local setup (lib/ is flattened), but still I would argue for setting up two areas, /runtime/deploy and /runtime/local - it's painfully obvious then what parts you need to deploy to a Hadoop cluster. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch Updated patch that moves nutch.jar to lib/ for the local runtime. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-843.patch, NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-844) Improve NutchConfiguration
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-844: Attachment: conf.patch Improve NutchConfiguration -- Key: NUTCH-844 URL: https://issues.apache.org/jira/browse/NUTCH-844 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: conf.patch This patch cleans up NutchConfiguration from servlet dependency, and modifies the API to allow bootstrapping via API from Properties. This is important for use cases where Nutch is embedded in a larger application. Also, while I'm at it, remove the support for alternative crawl configuration when running Crawl tool, which has always been a source of confusion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886318#action_12886318 ] Andrzej Bialecki commented on NUTCH-843: - runtime/local doesn't need Hadoop scripts, by definition it uses local FS and local job tracker, so Hadoop scripts are of no use. Native libs .. see NUTCH-845. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-843.patch, NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-843) Separate the build and runtime environments
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886330#action_12886330 ] Andrzej Bialecki commented on NUTCH-843: - Pseudo-distributed (i.e. a real JobTracker with a single TaskTracker) suffers from the same classpath issues that I described above, so even in such case it's best to run jobs in a separate environment, using /runtime/deploy artifacts. Separate the build and runtime environments --- Key: NUTCH-843 URL: https://issues.apache.org/jira/browse/NUTCH-843 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-843.patch, NUTCH-843.patch Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo-distributed) setup much more challenging and tricky. Also, some resources (config files and classes) are included several times on the classpath, they are loaded under different classloaders, and in the end it's not obvious what copy and why takes precedence. Here's an example of a harmful unintended behavior caused by this mess: Hadoop daemons (jobtracker and tasktracker) will get conf/ and build/ on their classpath. This means that a task running on this cluster will have two copies of resources from these locations - one from the inherited classpath from tasktracker, and the other one from the just unpacked nutch.job file. If these two versions differ, only the first one will be loaded, which in this case is the one taken from the (unpacked) conf/ and build/ - the other one, from within the nutch.job file, will be ignored. It's even worse when you add more nodes to the cluster - the nutch.job will be shipped to the new nodes as a part of each task setup, but now the remote tasktracker child processes will use resources from nutch.job - so some tasks will use different versions of resources than other tasks. This usually leads to a host of very difficult to debug issues. This issue proposes then to separate these environments into the following areas: * source area - i.e. our current sources. Note that bin/ scripts will belong to this category too, so there will be no top-level bin/. nutch-default.xml belongs to this category too. Other customizable files can be moved to src/conf too, or they could stay in top-level conf/ as today, with a README that explains that changes made there take effect only after you rebuild the job jar. * build area - contains build artifacts, among them the nutch.job jar. * runtime (or deploy) area - this area contains all artifacts needed to run Nutch jobs. For a distributed setup that uses an existing Hadoop cluster (installed from plain vanilla Hadoop release) this will be a {{/deploy}} directory, where we put the following: {code} bin/nutch nutch.job {code} That's it - nothing else should be needed, because all other resources are already included in the job jar. These resources can be copied directly to the master Hadoop node. For a local setup (using LocalJobTracker) this will be a {{/runtime}} directory, where we put the following: {code} bin/nutch lib/hadoop-libs plugins/ nutch.job {code} Due to limitations in the PluginClassLoader the local runtime requires that the plugins/ directory be unpacked from the job jar. And we need the hadoop libs to run in the local mode. We may later on refine this local setup to something like this: {code} bin/nutch conf/ lib/hadoop-libs lib/nutch-libs plugins/ nutch.jar {code} so that it's easier to modify the config without rebuilding the job jar (which actually would not be used in this case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-845) Native hadoop libs not available through maven
[ https://issues.apache.org/jira/browse/NUTCH-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-845. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 961778. Thanks for review! Native hadoop libs not available through maven -- Key: NUTCH-845 URL: https://issues.apache.org/jira/browse/NUTCH-845 Project: Nutch Issue Type: Bug Components: build Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 There are no maven artifacts for the native libs (I verified this on Hadoop ML). I think it's better to delete the libs, after all we don't want to keep bits and pieces of dependencies in our svn, but let's leave a placeholder and a README that explains how to get them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-844) Improve NutchConfiguration
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-844: Attachment: NUTCH-844.patch Updated patch. This also addresses an issue in PluginRepository that uses Configuration as a key in its internal cache - the problem though is that Configuration doesn't implement hashCode, so the cache would have been ineffective in situations like this: {code} Configuration conf = NutchConfiguration.create(); PluginRepository repo1 = PluginRepository.get(conf); JobConf job = new NutchJob(conf); PluginRepository repo2 = PluginRepository.get(job); // repo2 is a new instance, but should be the same instance! {code} The new code sets a UUID property, so the cache knows it's still the same instance. There's a new unit test to ensure this works properly when using NutchConfiguration.create(), and illustrates that it fails without the uuid. Improve NutchConfiguration -- Key: NUTCH-844 URL: https://issues.apache.org/jira/browse/NUTCH-844 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: conf.patch, NUTCH-844.patch This patch cleans up NutchConfiguration from servlet dependency, and modifies the API to allow bootstrapping via API from Properties. This is important for use cases where Nutch is embedded in a larger application. Also, while I'm at it, remove the support for alternative crawl configuration when running Crawl tool, which has always been a source of confusion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-844) Improve NutchConfiguration
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-844. - Resolution: Fixed Committed in r964063. Thanks for review! Improve NutchConfiguration -- Key: NUTCH-844 URL: https://issues.apache.org/jira/browse/NUTCH-844 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: conf.patch, NUTCH-844.patch This patch cleans up NutchConfiguration from servlet dependency, and modifies the API to allow bootstrapping via API from Properties. This is important for use cases where Nutch is embedded in a larger application. Also, while I'm at it, remove the support for alternative crawl configuration when running Crawl tool, which has always been a source of confusion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-858: Assignee: Andrzej Bialecki Fix Version/s: 1.2 No longer able to set per-field boosts on lucene documents -- Key: NUTCH-858 URL: https://issues.apache.org/jira/browse/NUTCH-858 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.1 Environment: n/a Reporter: Edward Drapkin Assignee: Andrzej Bialecki Fix For: 1.2 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it no longer seems possible to set boosts on specific fields in lucene documents. This is, in my opinion, a major feature regression and removes a huge component to fine tuning search. Can this be added? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890873#action_12890873 ] Andrzej Bialecki commented on NUTCH-858: - Unfortunately no. The patch was included in a fix to NUTCH-837, which is relative to trunk, and it's not directly applicable to 1.x, needs to be ported. No longer able to set per-field boosts on lucene documents -- Key: NUTCH-858 URL: https://issues.apache.org/jira/browse/NUTCH-858 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.1 Environment: n/a Reporter: Edward Drapkin Assignee: Andrzej Bialecki Fix For: 1.2 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it no longer seems possible to set boosts on specific fields in lucene documents. This is, in my opinion, a major feature regression and removes a huge component to fine tuning search. Can this be added? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-863) Benchmark and a testbed proxy server
[ https://issues.apache.org/jira/browse/NUTCH-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-863. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 980932. Benchmark and a testbed proxy server Key: NUTCH-863 URL: https://issues.apache.org/jira/browse/NUTCH-863 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: proxy.patch This issue adds two components: * a testbed proxy server that can serve various content: pre-fetched Nutch segments, forward requests to original URLs, or create a lot of unique but predictable fake content (with outlinks) on the fly. * a simple Benchmark class to measure the time taken to complete several crawl cycles using fake content. * 'ant proxy' and 'ant benchmark' targets to execute a benchmark run. Together these tools should provide a more or less objective method to measure the end-to-end crawl performance. This initial version can be further instrumented to collect statistics about various stages of data processing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-867) Port Nutch benchmark to Nutchbase
Port Nutch benchmark to Nutchbase - Key: NUTCH-867 URL: https://issues.apache.org/jira/browse/NUTCH-867 Project: Nutch Issue Type: New Feature Affects Versions: nutchbase Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: nutchbase Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the Nutchbase branch vs. trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895377#action_12895377 ] Andrzej Bialecki commented on NUTCH-858: - It was r960064, but I have to admit I sneaked in this improvement as a part of NUTCH-837, which contained a lot of other stuff... No longer able to set per-field boosts on lucene documents -- Key: NUTCH-858 URL: https://issues.apache.org/jira/browse/NUTCH-858 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.1 Environment: n/a Reporter: Edward Drapkin Assignee: Andrzej Bialecki Fix For: 1.2 I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it no longer seems possible to set boosts on specific fields in lucene documents. This is, in my opinion, a major feature regression and removes a huge component to fine tuning search. Can this be added? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-867) Port Nutch benchmark to Nutchbase
[ https://issues.apache.org/jira/browse/NUTCH-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-867: Attachment: benchmark.patch Ported benchmark that uses HSQLDB as the store impl. If there are no objections I'll commit it shortly. Port Nutch benchmark to Nutchbase - Key: NUTCH-867 URL: https://issues.apache.org/jira/browse/NUTCH-867 Project: Nutch Issue Type: New Feature Affects Versions: nutchbase Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: nutchbase Attachments: benchmark.patch Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the Nutchbase branch vs. trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http
[ https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-876: Attachment: NUTCH-876.patch Patch to fix the issue. If there are no objections I'll commit this shortly. Remove remaining robots/IP blocking code in lib-http Key: NUTCH-876 URL: https://issues.apache.org/jira/browse/NUTCH-876 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: NUTCH-876.patch There are remains of the (very old) blocking code in lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage politeness limits. New trunk doesn't have OldFetcher anymore, so this code is useless. Furthermore, there is an actual bug here - FetcherJob forgets to set Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults in lib-http are set to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-879) URL-s getting lost
URL-s getting lost -- Key: NUTCH-879 URL: https://issues.apache.org/jira/browse/NUTCH-879 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 * using 1-node Hadoop + HDFS * trunk r983472, using MySQL store * branch-1.3 Reporter: Andrzej Bialecki I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Description: This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. was: This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling JSON requests * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default
FetcherJob should run more reduce tasks than default Key: NUTCH-884 URL: https://issues.apache.org/jira/browse/NUTCH-884 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-872. - Fix Version/s: 2.0 Resolution: Fixed I changed the name of the option to -parse to be consistent with the nutch-default.xml naming. I also updated the API to use this name, it's less confusing this way. Committed in rev. 984401. Thanks for the feedback. Change the default fetcher.parse to FALSE - Key: NUTCH-872 URL: https://issues.apache.org/jira/browse/NUTCH-872 Project: Nutch Issue Type: Improvement Affects Versions: 1.2, 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 I propose to change this property to false. The reason is that it's a safer default - parsing issues don't lead to a loss of the downloaded content. For larger crawls this is the recommended way to run Fetcher. Users that run smaller crawls can still override it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-884: Attachment: NUTCH-884.patch Patch with the change. I also rearranged the arguments to FetcherJob.fetch(..) to make more sense (IMHO). FetcherJob should run more reduce tasks than default Key: NUTCH-884 URL: https://issues.apache.org/jira/browse/NUTCH-884 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-884.patch FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899810#action_12899810 ] Andrzej Bialecki commented on NUTCH-882: - This functionality is very useful for larger crawls. Some comments about the design: * the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of the existing jobs... so this means an additional step in the workflow. * I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable... A little background: what we see here is a design issue typical for mapreduce, where you have to merge data keyed by keys from different spaces (with different granularity). Possible solutions involve: * first converting the data to a common key space and then submit both data as mapreduce inputs, or * submitting only the finer-grained input to mapreduce and dynamically converting the keys on the fly (and reading data directly from the coarser-grained source, accessing it randomly). A similar situation is described in HADOOP-3063 together with a solution, namely, to use random access and use Bloom filters to quickly discover missing keys. So I propose that instead of statically merging the data (HostMDApplierJob) we could merge it dynamically on the fly, by implementing a high-performance reader of host table, and then use this reader directly in the context of map()/reduce() tasks as needed. This reader should use a Bloom filter to quickly determine nonexistent keys, and it may use a limited amount of in-memory cache for existing records. The bloom filter data should be re-computed on updates and stored/retrieved, to avoid lengthy initialization. The cost of using this approach is IMHO much smaller than the cost of statically joining this data. The static join costs both space and time to execute an additional jon. Let's consider the dynamic join cost, e.g. in Fetcher - HostDBReader would be used only when initializing host queues, so the number of IO-s would be at most the number of unique hosts on the fetchlist (at most, because some of host data may be missing - here's Bloom filter to the rescue to quickly discover this without doing any IO). During updatedb we would likely want to access this data in DbUpdateReducer. Keys are URLs here, and they are ordered in ascending order - but they are in host-reversed format, which means that URLs from similar hosts and domains are close together. This is beneficial, because when we read data from HostDBReader we will read records that are close together, thus avoiding seeks. We can also cache the retrieved per-host data in DbUpdateReducer. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps
Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900455#action_12900455 ] Andrzej Bialecki commented on NUTCH-891: - Yes, this would help. Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-893: Attachment: NUTCH-893.patch Unit test to illustrate the issue. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Attachments: NUTCH-893.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904226#action_12904226 ] Andrzej Bialecki commented on NUTCH-893: - Dogacan, flush() doesn't help - there are still missing keys. What's interesting is that the missing keys form sequential ranges. Could this be perhaps an issue with connection management, or some synchronization issue? DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297 ] Andrzej Bialecki commented on NUTCH-893: - Very good catch - yes, the test now passes for me too. This is actually good news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate if you have any ideas how to solve that. I suspect we may be losing keys in Generator or Fetcher, due to partitioning collisions but this hypothesis needs to be tested. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908791#action_12908791 ] Andrzej Bialecki commented on NUTCH-893: - +1 and +1. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757 ] Andrzej Bialecki commented on NUTCH-882: - +1 to NutchContext. See also NUTCH-907 because the changes required in Gora API will likely make this task easier (once implemented ;) ). Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109 ] Andrzej Bialecki commented on NUTCH-907: - That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality. DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API.patch Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working ;) I would appreciate a review and comments. REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-862) HttpClient null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-862: --- Assignee: Andrzej Bialecki HttpClient null pointer exception - Key: NUTCH-862 URL: https://issues.apache.org/jira/browse/NUTCH-862 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Environment: linux, java 6 Reporter: Sebastian Nagel Assignee: Andrzej Bialecki Priority: Minor Attachments: NUTCH-862.patch When re-fetching a document (a continued crawl) HttpClient throws an null pointer exception causing the document to be emptied: 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching http://localhost/doc/selfhtml/html/index.htm 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:138) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) 2010-07-27 12:45:09,204 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of http://localhost/doc/selfhtml/html/index.htm failed with: java.lang.NullPointerException Because the document is re-fetched the server answers 304 (not modified): 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] GET /doc/selfhtml/html/index.htm HTTP/1.0 304 174 - Nutch-1.0 No content is sent in this case (empty http body). Index: trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java === --- trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (revision 979647) +++ trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java (working copy) @@ -134,7 +134,8 @@ if (code == 200) throw new IOException(e.toString()); // for codes other than 200 OK, we are fine with empty content } finally { -in.close(); +if (in != null) + in.close(); get.abort(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-906. - Fix Version/s: 1.2 Resolution: Fixed Fixed in rev. 998261. Thanks! Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names Key: NUTCH-906 URL: https://issues.apache.org/jira/browse/NUTCH-906 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 1.1 Environment: Debian GNU/Linux 64-bit Reporter: Asheesh Laroia Assignee: Andrzej Bialecki Fix For: 1.2 Attachments: 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch Original Estimate: 0.33h Remaining Estimate: 0.33h The Nutch FAQ explains that OpenSearch includes all fields that are available at search result time. However, some Lucene column names can start with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch results for a document with a Lucene document column whose name starts with numbers, the underlying Xerces library throws this exception: org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. So I have written a patch that tests strings before they are used to generate tags within OpenSearch. I hope you merge this, or a better version of the patch! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site
[ https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474 ] Andrzej Bialecki commented on NUTCH-909: - bq. It might be better to see the message Search with Apache Solr (as on the TIKA's site). Yes, let's make this uniform. Add alternative search-provider to Nutch site - Key: NUTCH-909 URL: https://issues.apache.org/jira/browse/NUTCH-909 Project: Nutch Issue Type: Improvement Components: documentation Reporter: Alex Baranau Priority: Minor Attachments: NUTCH-909.patch Add additional search provider (to existed Lucid Find) search-lucene.com. Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1 According to Andrzej's suggestion, when preparing the patch let's follow the same rationales as those in TIKA-488, since they are applicable here too, so please refer to that issue for more insight on implementation details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118 ] Andrzej Bialecki commented on NUTCH-880: - bq. I think we can combine the approach you outlined in NUTCH-907 with this one. I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration. bq. What is CLASS ? It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool. bq. Btw, Andrzej, I will be happy to help out with the implementation if you want. By all means - I didn't have time so far to progress beyond this patch... REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: API.patch This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch, this looks very good! A few comments: * I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. * since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. * IndexerMapReduce.createIndexJob: I think it would be useful to pass the datasetId as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well... DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-907.patch In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this crawlId value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter. Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one. Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 Attachments: hostdb.patch, NUTCH-882-v1.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from the simplification in 2.x as compared to 1.x, in that we keep a single status per page. In 1.x a side-effect of having two locations with two statuses (one db status in crawldb and one fetch status in segments) was that we had more information in updatedb to act upon. Now we should probably keep up to two statuses - one that reflects a temporary fetch status, as determined by fetcher, and a final (reconciled) status as determined by updatedb, based on the knoweldge of not only plain fetch status and old status but also possible redirects. If I'm not mistaken currently the status is immediately overwritten by fetcher, even before we come to updatedb, hence the problem.. Fetcher generates entries with status 0 --- Key: NUTCH-864 URL: https://issues.apache.org/jira/browse/NUTCH-864 Project: Nutch Issue Type: Bug Components: fetcher Environment: Gora with SQLBackend URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase Last Changed Rev: 980748 Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) Reporter: Julien Nioche Assignee: Doğacan Güney Fix For: 2.0 After a round of fetching which got the following protocol status : 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): 1177 (SUCCESS=1177) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): 93 (EXCEPTION=93) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): 138 (TEMP_MOVED=138) 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): 521 (MOVED=521) 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done There should not be any entries with status 0 (null) I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610 ] Andrzej Bialecki commented on NUTCH-913: - There are formatting issues in DomainStatistics.java - the file uses literal tabs, which we frown upon, but the patch introduces double-space indent in the changed lines. As ugly as it sounds I think this should be changed into tabs, and then reformatted in another commit. Other than that, +1, go for it. Nutch should use new namespace for Gora --- Key: NUTCH-913 URL: https://issues.apache.org/jira/browse/NUTCH-913 Project: Nutch Issue Type: Bug Components: storage Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 2.0 Attachments: NUTCH-913_v1.patch Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace from org.gora to org.apache.gora. This means nutch should use the new namespace otherwise it won't compile with newer builds of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-921: Attachment: NUTCH-921.patch Patch that implements reading config parameters from Configuration, and falls back to config files if Configuration properties are unspecified. Reduce dependency of Nutch on config files -- Key: NUTCH-921 URL: https://issues.apache.org/jira/browse/NUTCH-921 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Attachments: NUTCH-921.patch Currently many components in Nutch rely on reading their configuration from files. These files need to be on the classpath (or packed into a job jar). This is inconvenient if you want to manage configuration via API, e.g. when embedding Nutch, or running many jobs with slightly different configurations. This issue tracks the improvement to make various components read their config directly from Configuration properties. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.