[Nutch-dev] [jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517913 ] Andrzej Bialecki commented on NUTCH-532: - The compatibility code in CrawlDatum is misplaced, I think

[Nutch-dev] [jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

2007-08-03 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517580 ] Andrzej Bialecki commented on NUTCH-535: - I also noticed (in a job unrelated to Nutch) that Hadoop sometimes

[Nutch-dev] [jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517402 ] Andrzej Bialecki commented on NUTCH-532: - Float values were originally intended to express fractions

[Nutch-dev] [jira] Commented: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516408 ] Andrzej Bialecki commented on NUTCH-533: - +1. Please fix the typo (present also in the original file): empy

[Nutch-dev] [jira] Commented: (NUTCH-514) Indexer should only index pages with fetch status SUCCESS

2007-07-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516428 ] Andrzej Bialecki commented on NUTCH-514: - +1 we're only humans with 24 hours in a day .. ;) Actually

[Nutch-dev] [jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-26 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515812 ] Andrzej Bialecki commented on NUTCH-439: - Some minor issues: * TLDScoringFilter contains a misspelled field

[Nutch-dev] [jira] Commented: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514914 ] Andrzej Bialecki commented on NUTCH-525: - +1 for adding undeleteAll(). When DDRecordReader was created

[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513853 ] Andrzej Bialecki commented on NUTCH-518: - IMHO this change is not helpful. It takes away too much control

Re: [Nutch-dev] Looking to fix relative path issue in linkdb

2007-07-19 Thread Andrzej Bialecki
trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

[Nutch-dev] [jira] Reopened: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-518: - This one was too quick, I think ... I wanted to discuss the issue whether the chaining

[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513704 ] Andrzej Bialecki commented on NUTCH-518: - Right, I was too quick too ... ;) Leave it in for now. Let's agree

[Nutch-dev] [jira] Commented: (NUTCH-516) Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE

2007-07-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513258 ] Andrzej Bialecki commented on NUTCH-516: - setPageGoneSchedule method was specifically added to handle

[Nutch-dev] [jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513019 ] Andrzej Bialecki commented on NUTCH-515: - +1 - sorry for the mess up ... Next fetch time is set

[Nutch-dev] [jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512139 ] Andrzej Bialecki commented on NUTCH-505: - Please test Java 1.5 and Java 1.6 - IIRC there are some

[Nutch-dev] [jira] Closed: (NUTCH-511) Recrawling

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-511. --- Resolution: Invalid Assignee: Andrzej Bialecki Please use mailing lists

[Nutch-dev] [jira] Closed: (NUTCH-512) Search on date range

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-512. --- Resolution: Invalid Please use mailing lists for such questions. Search on date range

[Nutch-dev] [jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511362 ] Andrzej Bialecki commented on NUTCH-439: - Very nice patch! A couple comments: * the fix

Re: [Nutch-dev] Fwd: [Collex] application#index (ActionController::RoutingError) no route found to match \/nines/ escape(document.title) u, \ with {:method=:get}

2007-07-10 Thread Andrzej Bialecki
and tries to build absolute URLs out of them. In may cases the strings have nothing to do with URLs. Please see NUTCH-505 for more details. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

[Nutch-dev] [jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-10 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511447 ] Andrzej Bialecki commented on NUTCH-505: - * In ParseOutputFormat, the calculation of outlinksToStore should

Re: [Nutch-dev] OPIC scoring differences

2007-07-09 Thread Andrzej Bialecki
://wiki.apache.org/nutch/FixingOpicScoring bottom of the page). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: [Nutch-dev] Not renewing CrawlDatum on Inject

2007-07-09 Thread Andrzej Bialecki
in Injector.InjectReducer. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-dev] Plans on releasing another bug fix release?

2007-07-04 Thread Andrzej Bialecki
not be as full of features as one might wished, but that's what the 1.0 release implies - it's usable, with some limitations. Will it be soon? :) I'm pretty sure it will be some time after the vacation period is over, not earlier ;) -- Best regards, Andrzej Bialecki

[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508816 ] Andrzej Bialecki commented on NUTCH-392: - Re: Content versioning - we can use negative int values as version

[Nutch-dev] [jira] Commented: (NUTCH-501) Implement a different caching mechanism for objects cached in configuration

2007-06-23 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507609 ] Andrzej Bialecki commented on NUTCH-501: - +1 - looks good. An idea: perhaps we could add a LOG.debug

[Nutch-dev] [jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507168 ] Andrzej Bialecki commented on NUTCH-504: - +1 - we should skip documents that failed to parse properly

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775 ] Andrzej Bialecki commented on NUTCH-497: - The patch looks good to me as it is now - however, I've seen

Re: [Nutch-dev] upgrade to hadoop-0.13?

2007-06-18 Thread Andrzej Bialecki
HADOOP-1343 for more details. This change will affect a lot of places in our code, so it would be best to do it long before the next Nutch release. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

[Nutch-dev] [jira] Commented: (NUTCH-501) implementing a different caching mechanism for objects

2007-06-18 Thread Andrzej Bialecki (JIRA)
.= jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505807 ]=20 Andrzej Bialecki commented on NUTCH-501: - ObjectCache should support caching objects that fall under the same key, bu= t are differently configured. This situation occurs when running in local

[Nutch-dev] [jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505598 ] Andrzej Bialecki commented on NUTCH-485: - Whitespace changes should be committed as a separate patch

[Nutch-dev] [jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302 ] Andrzej Bialecki commented on NUTCH-498: - Currently there is no difference, indeed. The version

[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500951 ] Andrzej Bialecki commented on NUTCH-392: - I don't think it's a good idea, it's creating too many cryptic

[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ] Andrzej Bialecki commented on NUTCH-392: - Good point. We can change it to use the following pattern

[Nutch-dev] [jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728 ] Andrzej Bialecki commented on NUTCH-392: - I think it is okay to allow BLOCK compression for linkdb, crawldb

Re: [Nutch-dev] Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-dev] Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki
Briggs wrote: Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. Yes, please try this change. -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] [PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-01 Thread Andrzej Bialecki
, in my opinion it should not be applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: [Nutch-dev] [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-05-31 Thread Andrzej Bialecki
rubdabadub wrote: MANY MANY Super Thanks! I can't thank you enough for this Patch :-) This is so cool!!! You're welcome :) I would appreciate it if you could give it some testing and provide feedback ... -- Best regards, Andrzej Bialecki

[Nutch-dev] [jira] Updated: (NUTCH-466) Flexible segment format

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-466: Attachment: segmentparts.patch This patch contains the following modifications

[Nutch-dev] [jira] Resolved: (NUTCH-486) Break searcher dependency on commons-cli

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-486. - Resolution: Won't Fix Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki

[Nutch-dev] [jira] Resolved: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-392. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki

Re: [Nutch-dev] Plugins initialized all the time!

2007-05-30 Thread Andrzej Bialecki
tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

[Nutch-dev] [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-05-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-61. Resolution: Fixed Fix Version/s: 1.0.0 Applied with some modifications in rev. 542903

[Nutch-dev] [jira] Work started: (NUTCH-466) Flexible segment format

2007-05-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-466 started by Andrzej Bialecki . Flexible segment format --- Key: NUTCH-466 URL: https

Re: [Nutch-dev] Get meta name=description and other meta tags from Content

2007-05-23 Thread Andrzej Bialecki
you need to use HtmlParseFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: [Nutch-dev] Issues pending before 0.9 release

2007-05-18 Thread Andrzej Bialecki
rubdabadub wrote: On 3/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: rubdabadub wrote: Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. This was discussed before

[Nutch-dev] [jira] Commented: (NUTCH-486) Break searcher dependency on commons-cli

2007-05-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495867 ] Andrzej Bialecki commented on NUTCH-486: - -1 This is a side-effect of using LinkDbReader to read the LinkDb

[Nutch-dev] [jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-443: Attachment: patch.txt I'm not too happy with the direction you took in the latest patch

[Nutch-dev] [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495797 ] Andrzej Bialecki commented on NUTCH-443: - Indeed... I forgot that we need crawl_parse to collect new sub

[Nutch-dev] [jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495319 ] Andrzej Bialecki commented on NUTCH-485: - I think a more natural change would be this: ParseResult filter

[Nutch-dev] [jira] Assigned: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-443: --- Assignee: Andrzej Bialecki (was: Chris A. Mattmann) allow parsers to return

[Nutch-dev] [jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-443. - Resolution: Fixed Committed in rev. 536606. Big thanks to all who contributed

[Nutch-dev] [jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-467. - Resolution: Fixed Assignee: Andrzej Bialecki Patch applied in rev. 532105

[Nutch-dev] [jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-418. --- Resolution: Fixed Fix Version/s: 0.9.0 Already applied. Fixes parsing of XHTML (e.g

[Nutch-dev] [jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-417. --- Resolution: Fixed Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki Fixed

[Nutch-dev] [jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494552 ] Andrzej Bialecki commented on NUTCH-393: - I agree with that - either all filters should run or the document

Re: [Nutch-dev] svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/

2007-05-09 Thread Andrzej Bialecki
? Indeed. Thanks for spotting this - it's fixed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

[Nutch-dev] [jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-393. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Both

[Nutch-dev] [jira] Commented: (NUTCH-479) Support for OR queries

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494582 ] Andrzej Bialecki commented on NUTCH-479: - Correct - the only syntax element added in this patch

[Nutch-dev] [jira] Created: (NUTCH-479) Support for OR queries

2007-05-07 Thread Andrzej Bialecki (JIRA)
: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 1.0.0 There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA

[Nutch-dev] [jira] Updated: (NUTCH-479) Support for OR queries

2007-05-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-479: Attachment: or.patch Patch based on the discussion on the mailing list, and a description

Re: [Nutch-dev] SIGSEGV

2007-05-06 Thread Andrzej Bialecki
multithreaded apps linked to libc_r or libpthread. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

[Nutch-dev] [jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-03 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-477: Attachment: urlfilters.patch This patch implements suggested changes. Extend URLFilters

[Nutch-dev] [jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492386 ] Andrzej Bialecki commented on NUTCH-468: - +1. I'm writing a scoring plugin now where it's impossible

Re: [Nutch-dev] Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
something here? See the ASCII-art graphs and comments in NUTCH-385 - this is likely not what is expected. Although this JIRA issue is still open, the Fetcher2 code tries to implement this middle ground solution. -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
} (notice the protocol/http difference) to false to indicate lib-http shouldn't handle blocking internally. Because of this, when you use Fetcher2, lib-http still tries to block them which makes Fetcher2 much less useful. This is definitely a bug. -- Best regards, Andrzej Bialecki

[Nutch-dev] [jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491290 ] Andrzej Bialecki commented on NUTCH-471: - +1. Nice trick with the unsynchronized check. :) Fix

Re: [Nutch-dev] Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
, you're right - it's a bug. However, the reasoning that I presented still holds, it's just the implementation that doesn't get it ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

[Nutch-dev] [jira] Closed: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-474. --- Resolution: Fixed Assignee: Andrzej Bialecki Fixed in rev. 532088. Thanks! Fetcher2

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-18 Thread Andrzej Bialecki
wangxu wrote: Andrzej Bialecki wrote: Howie Wang wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki
be a good option to have, especially for smaller setups - but it would require extensive modifications to many tools in Nutch. Unless you are willing to provide patches that implement it without breaking the large-scale case, I think we should let the matter rest ... -- Best regards, Andrzej

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki
in the DB is used not standalone, but as one of many inputs to a map-reduce job. To summarize - I think it would be very difficult to do this with the current codebase. -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Andrzej Bialecki
this issue. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
think we should move forward. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
Chris Mattmann wrote: [..] [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... +1. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
and discovering new issues, and patching them, we will never make a release ... I think for issues that are not critical or blocker we should press forward, otherwise we will have to wait another 72 hours, and another, and another ... -- Best regards, Andrzej Bialecki

[Nutch-dev] [jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485986 ] Andrzej Bialecki commented on NUTCH-466: - Minor nit: MapFile requires that the key is a WritableComparable

[Nutch-dev] [jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486003 ] Andrzej Bialecki commented on NUTCH-466: - I thought that the map will be from class names to directory

[Nutch-dev] [jira] Created: (NUTCH-466) Flexible segment format

2007-04-01 Thread Andrzej Bialecki (JIRA)
: Andrzej Bialecki Assigned To: Andrzej Bialecki In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-30 Thread Andrzej Bialecki
more (in a separate thread) before rewriting the how to release page in wiki. I agree - the current release process didn't fare too well in this particular situation ... -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-29 Thread Andrzej Bialecki
Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag and so on until we are satisfied. Then when we're actually satisfied create tag for 0.9 (copy from rc that got promoted

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-29 Thread Andrzej Bialecki
Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag and so on until we are satisfied. Then when we're

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Andrzej Bialecki
Hadoop installation, i.e. one where Hadoop daemons are started without Nutch classes on the classpath. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

[Nutch-dev] [jira] Closed: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2007-03-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-246. --- Resolution: Fixed Assignee: Andrzej Bialecki Thanks for reminding us about

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
Sami Siren wrote: for me it works: ... BUILD SUCCESSFUL Total time: 4 minutes 3 seconds I did a fresh checkout to an empty dir, rebuilt and it's still failing - perhaps you have some uncommitted changes in your working copy ... ? -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
rubdabadub wrote: Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. This was discussed before - it will be applied right after the release. -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
is the reason - it seems that the results of text extraction are completely different under 1.6 ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-20 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except

Re: [Nutch-dev] Launching custom classes

2007-03-19 Thread Andrzej Bialecki
is in the classpath. I think that What needs to be on your classpath is the *.job jar. The bin/nutch script takes care of that if you built your Nutch using the command-line version of ant. -- Best regards, Andrzej Bialecki

[Nutch-dev] [jira] Commented: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266 ] Andrzej Bialecki commented on NUTCH-381: - Your last comment confirms my suspicions. After analysis

[Nutch-dev] [jira] Closed: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-381. --- Resolution: Won't Fix Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki

[Nutch-dev] [jira] Closed: (NUTCH-277) Fetcher dies because of max. redirects (avoiding infinite loop)

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-277. --- Resolution: Cannot Reproduce Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki

[Nutch-dev] [jira] Closed: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-459. --- Resolution: Fixed Upgraded to 0.12.1 release. Upgrade Nutch to Hadoop 0.12.1

[Nutch-dev] [jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-353: Priority: Major (was: Blocker) This i partially fixed so that page status is consistent

[Nutch-dev] [jira] Closed: (NUTCH-450) How to set up nutch

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-450. --- Resolution: Invalid Assignee: Andrzej Bialecki This belongs in nutch-user mailing list

[Nutch-dev] [jira] Updated: (NUTCH-451) Tool to recover partial fetcher output

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-451: Priority: Minor (was: Major) Tool to recover partial fetcher output

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-19 Thread Andrzej Bialecki
-427 Moved to Major, fix after release. NUTCH-381 Won't fix - this is a configuration issue. NUTCH-277 Cannot reproduce NUTCH-167 Fixed. Any other stuff we need to fix before the release? -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-15 Thread Andrzej Bialecki
, in that case creating the JIRA issue might seem pointless, indeed ... but it's there now, so iff we discover any changes that need to be made we can attach them to this issue. -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] New Jira Hudson plugin

2007-03-15 Thread Andrzej Bialecki
=http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/24/) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: [Nutch-dev] DummySSLProtocolSocketFactory problem, please help me!!!!

2007-03-14 Thread Andrzej Bialecki
any answers. What helps is when you create a bug issue in JIRA, describe the problem and attach a patch that helped in your case. Thank you for your co-operation. ;) -- Best regards, Andrzej Bialecki

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Andrzej Bialecki
release. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

  1   2   3   4   5   6   7   8   >