Re: Nutch 2.0

2010-06-28 Thread Andrzej Bialecki
JIRA issue, prepare a patch like this, mark the checkbox, and list all dependencies and their licenses for those that are not already in Nutch svn? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Where is nutch 2.0

2010-06-29 Thread Andrzej Bialecki
On 2010-06-29 11:17, Raghavendra Neelekani wrote: Hi Can you please tell me from where can I download nutch 2.0 .? Nutch 2.0 is in the planning and early development phase, so it can't be downloaded yet. We hope to produce a working Nutch 2.0 some time in Q4 2010. -- Best regards, Andrzej

[Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Andrzej Bialecki
? If so, we should put this autogeneration step in our build.xml. Or am I missing something? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: Minimizing the number of stored fields for Solr

2010-07-03 Thread Andrzej Bialecki
to this proposal - of course we should review our schema, and of course we should have a mechanism to get data from the storage layer, but what you propose is IMHO a premature optimization at this point. -- Best regards, Andrzej Bialecki

YCSB benchmark for KV stores

2010-07-03 Thread Andrzej Bialecki
Hi, Found this link: http://wiki.github.com/brianfrankcooper/YCSB/papers-and-presentations Would be cool to run the benchmark for the same stores but via Gora. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Parse-tika ignores too much data...

2010-07-08 Thread Andrzej Bialecki
and retaining only the frameset. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Merging in nutchbase

2010-07-10 Thread Andrzej Bialecki
to be done this way. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Merging in nutchbase

2010-07-10 Thread Andrzej Bialecki
and tomorrow), nutch will run on gora - (embedded hsqldb) with zero configuration. Excellent, that would be a real breakthrough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Andrzej Bialecki
use case properly, this is really a custom Fetcher that you are talking about - a strategy to fetch complete pages (together with its resources that relate to the display of the page) should be possible to implement in a custom fetcher without changing other Nutch areas. -- Best regards, Andrzej

Re: Nutchbase merge strategy

2010-07-21 Thread Andrzej Bialecki
/IssueNavigator.jspa?reset=truepid=10680updated%3Aprevious=-1wcreated%3Aafter=1%2FApr%2F09status=1status=3status=4sorter/field=updatedsorter/order=DESC Actually, just two issues are still unresolved... hmm, not bad. -- Best regards, Andrzej Bialecki

[Nutchbase] Multi-value ParseResult missing

2010-07-21 Thread Andrzej Bialecki
the SubCrawler. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Benchmark of Nutch trunk

2010-07-30 Thread Andrzej Bialecki
... At the moment I can't figure out where this lock-up is happening, but the symptoms are obvious when you look at the logs in real-time. More stuff to come on this subject - at least we have a tool to experiment with :) -- Best regards, Andrzej Bialecki

Re: Seeking Insight into Nutch Configurations

2010-08-02 Thread Andrzej Bialecki
speed, and (rarely) the total bandwidth at your end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Seeking Insight into Nutch Configurations

2010-08-02 Thread Andrzej Bialecki
all pages from a big site in one go, at least your crawls will finish quickly and you will quickly progress breadth-wise, if not depth-wise. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

[Nutchbase] jmxtools issue...

2010-08-04 Thread Andrzej Bialecki
Hi, I can't compile nutchbase at the moment - ivy has trouble finding jmxri.jar and jmxtools.jar ... I found jmxri.jar somewhere and put it to my .ivy2/local, but I can't find jmxtools.jar ... Anyway, why do we need these two jars at all??? -- Best regards, Andrzej Bialecki

Hsqldb 2.0 conflicts with Hsqldb 1.8 in Hadoop

2010-08-10 Thread Andrzej Bialecki
-*.jar/ + includes=**/*.jar excludes=hadoop-*.jar,hsqldb*.jar/ zipfileset dir=${build.plugins} prefix=plugins/ /jar /target -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Tika HTML parsing

2010-08-15 Thread Andrzej Bialecki
? * what's the status of extracting the meta robots and link rel information? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Alternative search box for Nutch site

2010-08-30 Thread Andrzej Bialecki
not - when preparing the patch let's follow the same rationales as those in TIKA-488, since they are applicable here too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 2.0 (trunk)

2010-09-07 Thread Andrzej Bialecki
. Should I file this in nutch-jira or hithub/gora or nothing? environments : ubuntu 10.04 JVM : 1.6.0_20 nutch 2.0 (trunk) Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed Yes, please create a JIRA issue. Thanks! -- Best regards, Andrzej Bialecki

Re: [VOTE] Apache Nutch 1.2 Release Candidate #1

2010-09-10 Thread Andrzej Bialecki
On 2010-08-09 16:45, Julien Nioche wrote: I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be good to fix it before releasing 1.2 This is fixed. How about doing the release now? -- Best regards, Andrzej Bialecki

Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki
tomorrow. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Andrzej Bialecki
+ indexing to Solr went just fine. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: Build failed in Hudson: Nutch-trunk #1280

2010-10-19 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: ReviewBoard Instance

2010-10-26 Thread Andrzej Bialecki
... but what's the point of using the tool in our JIRA-based workflow? It looks to me like it duplicates at least part of JIRA's functionality, and the remaining part is what we do also in JIRA by convention... -- Best regards, Andrzej Bialecki

Re: Java.io.IOException with multiple copyField/ directives

2010-12-03 Thread Andrzej Bialecki
that the field type declared in your schema.xml is not multiValued. What was the exception? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Andrzej Bialecki
bug. - wait for Hadoop job completion in the Fetcher job I missed your previous email... I'll fix this shortly - thanks for spotting it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Gora/HBase dependencies and deploy artifacts

2011-02-24 Thread Andrzej Bialecki
how to deploy Gora backend implementations so that they work with Nutch and with a generic unmodified Hadoop cluster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [jira] Closed: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-10 Thread Andrzej Bialecki
improvements from 2.0 have been backported into 1.3 now The only remaining issue to address before rolling out a 1.3 release is NUTCH-914 Implement Apache Project Branding Requirements (and subtasks...) -- Best regards, Andrzej Bialecki

Re: Differences 1.x and trunk

2011-03-18 Thread Andrzej Bialecki
to specify what fields from WebPage you are interested in (and only these fields will be pulled in from the storage). This is all handled by StorageUtils methods. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [VOTE] Move 2.0 out of trunk

2011-09-20 Thread Andrzej Bialecki
for a usable platform, and continue redesign from that codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: [jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a ?

2011-10-12 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch Maven artifacts now published as polled/nightly SNAPSHOTS

2011-11-05 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Persistent problems with Ivy dependencies in Eclipse

2011-11-10 Thread Andrzej Bialecki
porting of a pure ant build to an ant+ivy build. We should determine what deps are really needed by these plugins, and sanitize the ivy.xml files so that they make sense - if the existing files can't be untangled we can ditch them and come up with new, clean ones. -- Best regards, Andrzej

Re: Signature == null ?

2011-11-15 Thread Andrzej Bialecki
in CrawlDbReducer... Do you notice any pattern to these pages? What's their origin? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Dependency Injection

2011-11-22 Thread Andrzej Bialecki
supposed to run ... so at that time we didn't think this complication was justified. If we can figure out something between full-blown OSGI and the current system then that would be great. -- Best regards, Andrzej Bialecki

Re: Dependency Injection

2011-11-23 Thread Andrzej Bialecki
On 23/11/2011 01:02, Andrzej Bialecki wrote: On 22/11/2011 19:47, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack

Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki
drop in the 0.22 jars and see if it compiles / tests are passing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki
is in org.apache.hadoop.mapreduce.lib.output . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-14 Thread Andrzej Bialecki
org.apache.hadoop.mapred.MapFileOutputFormat still uses the old api, and it's not deprecated yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: [jira] [Created] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Andrzej Bialecki
the other ones are easy to convert, too... I'm bogged with other work now, but I'll see if I can prepare an example later today... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki
/ + dependency org=junit name=junit rev=3.8.1 conf=*-default / dependency org=org.apache.hadoop name=hadoop-test rev=0.20.205.0 conf=test-default / -- Best regards, Andrzej Bialecki

Re: Build failed in Jenkins: Nutch-trunk #1706

2011-12-28 Thread Andrzej Bialecki
on the classpath and masks the issue? this happened to me once or twice... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Andrzej Bialecki
shine some light on what happened to Fetcher2.java that Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0) Fetcher2 is the current Fetcher. The original Fetcher was temporarily renamed OldFetcher and then removed. -- Best regards, Andrzej Bialecki

Re: question about ObjectCache

2012-04-10 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] Commented: (NUTCH-650) Hbase Integration

2010-06-29 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883559#action_12883559 ] Andrzej Bialecki commented on NUTCH-650: - So far as one can digest such a giant

[jira] Assigned: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-837: --- Assignee: Andrzej Bialecki Remove search servers and Lucene dependencies

[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: NUTCH-837.patch Updated patch against r959954 (after NUTCH-836). Remove

[jira] Updated: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-837: Attachment: (was: NUTCH-837.patch) Remove search servers and Lucene dependencies

[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884729#action_12884729 ] Andrzej Bialecki commented on NUTCH-837: - bq. So, I think we should still have

[jira] Resolved: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-837. - Resolution: Fixed Committed in r960064. Thanks for review! Remove search servers

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885188#action_12885188 ] Andrzej Bialecki commented on NUTCH-821: - I think this patch refers to some parts

[jira] Updated: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-696: Attachment: timeout.patch A simple patch that implements the strategy outlined here http

[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885257#action_12885257 ] Andrzej Bialecki commented on NUTCH-696: - Yes - this patch is a quick solution

[jira] Reopened: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-696: - This may be useful after all - let's gather more comments. Timeout for Parser

[jira] Commented: (NUTCH-696) Timeout for Parser

2010-07-05 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885295#action_12885295 ] Andrzej Bialecki commented on NUTCH-696: - I agree, ultimately that's the way to go

[jira] Commented: (NUTCH-821) Use ivy in nutch builds

2010-07-06 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885583#action_12885583 ] Andrzej Bialecki commented on NUTCH-821: - +1 for this patch for now - all good

[jira] Created: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Currently there is no clean separation of source, build and runtime artifacts. On one hand, it makes it easier to get started in local mode, but on the other hand it makes the distributed (or pseudo

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch This patch moves bin/nutch to src/bin/nutch, and creates

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886015#action_12886015 ] Andrzej Bialecki commented on NUTCH-843: - We need to create the job file anyway

[jira] Updated: (NUTCH-843) Separate the build and runtime environments

2010-07-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-843: Attachment: NUTCH-843.patch Updated patch that moves nutch.jar to lib/ for the local

[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-844: Attachment: conf.patch Improve NutchConfiguration

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886318#action_12886318 ] Andrzej Bialecki commented on NUTCH-843: - runtime/local doesn't need Hadoop scripts

[jira] Commented: (NUTCH-843) Separate the build and runtime environments

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886330#action_12886330 ] Andrzej Bialecki commented on NUTCH-843: - Pseudo-distributed (i.e. a real

[jira] Resolved: (NUTCH-845) Native hadoop libs not available through maven

2010-07-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-845. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 961778. Thanks for review

[jira] Updated: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-844: Attachment: NUTCH-844.patch Updated patch. This also addresses an issue in PluginRepository

[jira] Resolved: (NUTCH-844) Improve NutchConfiguration

2010-07-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-844. - Resolution: Fixed Committed in r964063. Thanks for review! Improve NutchConfiguration

[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-858: Assignee: Andrzej Bialecki Fix Version/s: 1.2 No longer able to set per-field

[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-07-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890873#action_12890873 ] Andrzej Bialecki commented on NUTCH-858: - Unfortunately no. The patch was included

[jira] Resolved: (NUTCH-863) Benchmark and a testbed proxy server

2010-07-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-863. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 980932. Benchmark

[jira] Created: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-07-31 Thread Andrzej Bialecki (JIRA)
Bialecki Assignee: Andrzej Bialecki Fix For: nutchbase Bring tools from NUTCH-863 to Nutchbase, and measure the performance of the Nutchbase branch vs. trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue

[jira] Commented: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-08-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895377#action_12895377 ] Andrzej Bialecki commented on NUTCH-858: - It was r960064, but I have to admit I

[jira] Updated: (NUTCH-867) Port Nutch benchmark to Nutchbase

2010-08-04 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-867: Attachment: benchmark.patch Ported benchmark that uses HSQLDB as the store impl

[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-876: Attachment: NUTCH-876.patch Patch to fix the issue. If there are no objections I'll commit

[jira] Created: (NUTCH-879) URL-s getting lost

2010-08-10 Thread Andrzej Bialecki (JIRA)
+ HDFS * trunk r983472, using MySQL store * branch-1.3 Reporter: Andrzej Bialecki I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln urls, while trunk collects ~20,000 urls. Clearly

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-08-11 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Description: This issue is for discussing a REST-style API for accessing Nutch. Here's

[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)
: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks

[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE

2010-08-11 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-872. - Fix Version/s: 2.0 Resolution: Fixed I changed the name of the option to -parse

[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-884: Attachment: NUTCH-884.patch Patch with the change. I also rearranged the arguments

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-08-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899810#action_12899810 ] Andrzej Bialecki commented on NUTCH-882: - This functionality is very useful

[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)
Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej

[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900455#action_12900455 ] Andrzej Bialecki commented on NUTCH-891: - Yes, this would help. Nutch build

[jira] Created: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)
Issue Type: Bug Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which

[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-893: Attachment: NUTCH-893.patch Unit test to illustrate the issue. DataStore.put() silently

[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904226#action_12904226 ] Andrzej Bialecki commented on NUTCH-893: - Dogacan, flush() doesn't help

[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-08 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297 ] Andrzej Bialecki commented on NUTCH-893: - Very good catch - yes, the test now

[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-13 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908791#action_12908791 ] Andrzej Bialecki commented on NUTCH-893: - +1 and +1. DataStore.put() silently

[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-15 Thread Andrzej Bialecki (JIRA)
Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-09-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909757#action_12909757 ] Andrzej Bialecki commented on NUTCH-882: - +1 to NutchContext. See also NUTCH-907

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910109#action_12910109 ] Andrzej Bialecki commented on NUTCH-907: - That's very good news - in that case I'm

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API.patch Initial patch for discussion. This is a work in progress, so only

[jira] Assigned: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-862: --- Assignee: Andrzej Bialecki HttpClient null pointer exception

[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-906. - Fix Version/s: 1.2 Resolution: Fixed Fixed in rev. 998261. Thanks! Nutch

[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site

2010-09-20 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912474#action_12912474 ] Andrzej Bialecki commented on NUTCH-909: - bq. It might be better to see the message

[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913118#action_12913118 ] Andrzej Bialecki commented on NUTCH-880: - bq. I think we can combine the approach

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch

[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment

[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from

[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-13 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920610#action_12920610 ] Andrzej Bialecki commented on NUTCH-913: - There are formatting issues

[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-921: Attachment: NUTCH-921.patch Patch that implements reading config parameters from

  1   2   >