[jira] [Updated] (NUTCH-2467) Sitemap type field can be null
[ https://issues.apache.org/jira/browse/NUTCH-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2467: - Attachment: NUTCH-2467.patch Incredible stupid patch but i did it because the sitemap.type thing being null is probably a bug. And this patch should probably be reverted if it is fixed. Any CC comments on this? > Sitemap type field can be null > -- > > Key: NUTCH-2467 > URL: https://issues.apache.org/jira/browse/NUTCH-2467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2467.patch > > > sitemap.isIndex() can return null for real sitemap indices, so there contents > won't be added to the CrawlDB. Example, the indices > https://www.reisenco.nl/sitemap_index.xml points to are not processed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2467) Sitemap type field can be null
[ https://issues.apache.org/jira/browse/NUTCH-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2467: - Patch Info: Patch Available > Sitemap type field can be null > -- > > Key: NUTCH-2467 > URL: https://issues.apache.org/jira/browse/NUTCH-2467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > > sitemap.isIndex() can return null for real sitemap indices, so there contents > won't be added to the CrawlDB. Example, the indices > https://www.reisenco.nl/sitemap_index.xml points to are not processed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2467) Sitemap type field can be null
Markus Jelsma created NUTCH-2467: Summary: Sitemap type field can be null Key: NUTCH-2467 URL: https://issues.apache.org/jira/browse/NUTCH-2467 Project: Nutch Issue Type: Bug Affects Versions: 1.13 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.14 sitemap.isIndex() can return null for real sitemap indices, so there contents won't be added to the CrawlDB. Example, the indices https://www.reisenco.nl/sitemap_index.xml points to are not processed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Patch Info: Patch Available > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2466: - Attachment: NUTCH-2466.patch Patch for master! > Sitemap processor to follow redirects > - > > Key: NUTCH-2466 > URL: https://issues.apache.org/jira/browse/NUTCH-2466 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2466.patch > > > It does follow http > https, but not the following redirect, e.g. > sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2466) Sitemap processor to follow redirects
Markus Jelsma created NUTCH-2466: Summary: Sitemap processor to follow redirects Key: NUTCH-2466 URL: https://issues.apache.org/jira/browse/NUTCH-2466 Project: Nutch Issue Type: Bug Affects Versions: 1.13 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.14 It does follow http > https, but not the following redirect, e.g. sitemap_index.xml that some websites have. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268625#comment-16268625 ] Hudson commented on NUTCH-2463: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See [https://builds.apache.org/job/Nutch-trunk/3469/]) NUTCH-2463 - Enable sampling CrawlDB (github: [https://github.com/apache/nutch/commit/65651b5cce54736978356ba1a8dea8a10f405d3c]) * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java > Enable sampling CrawlDB > --- > > Key: NUTCH-2463 > URL: https://issues.apache.org/jira/browse/NUTCH-2463 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Yossi Tamari >Priority: Minor > Fix For: 1.14 > > > CrawlDB can grow to contain billions of records. When that happens *readdb > -dump* is pretty useless, and *readdb -topN* can run for ages (and does not > provide a statistically correct sample). > We should add a parameter *-sample* to *readdb -dump* which is followed by a > number between 0 and 1, and only that fraction of records from the CrawlDB > will be processed. > The sample should be statistically random, and all the other filters should > be applied on the sampled records. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2458) TikaParser doesn't work with tika-config.xml set
[ https://issues.apache.org/jira/browse/NUTCH-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268626#comment-16268626 ] Hudson commented on NUTCH-2458: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3469 (See [https://builds.apache.org/job/Nutch-trunk/3469/]) NUTCH-2458 (snagel: [https://github.com/apache/nutch/commit/c17dd1dd6bf914beb7b13528c95b487630f86905]) * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > TikaParser doesn't work with tika-config.xml set > > > Key: NUTCH-2458 > URL: https://issues.apache.org/jira/browse/NUTCH-2458 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-2458.patch > > > Well, it doesn't indeed. Thanks to Timothy Allison, its solved. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268573#comment-16268573 ] ASF GitHub Bot commented on NUTCH-2463: --- YossiTamari commented on issue #243: NUTCH-2463 - Enable sampling CrawlDB URL: https://github.com/apache/nutch/pull/243#issuecomment-347489658 Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable sampling CrawlDB > --- > > Key: NUTCH-2463 > URL: https://issues.apache.org/jira/browse/NUTCH-2463 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Yossi Tamari >Priority: Minor > Fix For: 1.14 > > > CrawlDB can grow to contain billions of records. When that happens *readdb > -dump* is pretty useless, and *readdb -topN* can run for ages (and does not > provide a statistically correct sample). > We should add a parameter *-sample* to *readdb -dump* which is followed by a > number between 0 and 1, and only that fraction of records from the CrawlDB > will be processed. > The sample should be statistically random, and all the other filters should > be applied on the sampled records. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2463. Resolution: Implemented Thanks, [~yossi]! > Enable sampling CrawlDB > --- > > Key: NUTCH-2463 > URL: https://issues.apache.org/jira/browse/NUTCH-2463 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Yossi Tamari >Priority: Minor > Fix For: 1.14 > > > CrawlDB can grow to contain billions of records. When that happens *readdb > -dump* is pretty useless, and *readdb -topN* can run for ages (and does not > provide a statistically correct sample). > We should add a parameter *-sample* to *readdb -dump* which is followed by a > number between 0 and 1, and only that fraction of records from the CrawlDB > will be processed. > The sample should be statistically random, and all the other filters should > be applied on the sampled records. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268548#comment-16268548 ] ASF GitHub Bot commented on NUTCH-2463: --- sebastian-nagel closed pull request #243: NUTCH-2463 - Enable sampling CrawlDB URL: https://github.com/apache/nutch/pull/243 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java index bfb016428..e245e380c 100644 --- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java +++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java @@ -511,7 +511,7 @@ public void readUrl(String crawlDb, String url, JobConf config) public void processDumpJob(String crawlDb, String output, JobConf config, String format, String regex, String status, - Integer retry, String expr) throws IOException { + Integer retry, String expr, Float sample) throws IOException { if (LOG.isInfoEnabled()) { LOG.info("CrawlDb dump: starting"); LOG.info("CrawlDb db: " + crawlDb); @@ -544,6 +544,8 @@ public void processDumpJob(String crawlDb, String output, job.set("expr", expr); LOG.info("CrawlDb db: expr: " + expr); } +if (sample != null) + job.setFloat("sample", sample); job.setMapperClass(CrawlDbDumpMapper.class); job.setOutputKeyClass(Text.class); @@ -562,6 +564,7 @@ public void processDumpJob(String crawlDb, String output, String status = null; Integer retry = null; Expression expr = null; +float sample; public void configure(JobConf job) { if (job.get("regex", null) != null) { @@ -573,6 +576,7 @@ public void configure(JobConf job) { if (job.get("expr", null) != null) { expr = JexlUtil.parseExpression(job.get("expr", null)); } + sample = job.getFloat("sample", 1); } public void close() { @@ -582,6 +586,10 @@ public void map(Text key, CrawlDatum value, OutputCollectoroutput, Reporter reporter) throws IOException { + // check sample + if (sample < 1 && Math.random() > sample) { +return; + } // check retry if (retry != -1) { if (value.getRetriesSinceFetch() < retry) { @@ -693,6 +701,7 @@ public int run(String[] args) throws IOException { System.err .println("\t\t[-status ]\tfilter records by CrawlDatum status"); System.err.println("\t\t[-expr ]\tJexl expression to evaluate for this record"); + System.err.println("\t\t[-sample ]\tOnly process a random sample with this ratio"); System.err .println("\t-url \tprint information on to System.out"); System.err @@ -720,6 +729,7 @@ public int run(String[] args) throws IOException { Integer retry = null; String status = null; String expr = null; +Float sample = null; for (int j = i + 1; j < args.length; j++) { if (args[j].equals("-format")) { format = args[++j]; @@ -741,8 +751,12 @@ public int run(String[] args) throws IOException { expr = args[++j]; i=i+2; } + if (args[j].equals("-sample")) { +sample = Float.parseFloat(args[++j]); +i = i + 2; + } } -dbr.processDumpJob(crawlDb, param, job, format, regex, status, retry, expr); +dbr.processDumpJob(crawlDb, param, job, format, regex, status, retry, expr, sample); } else if (args[i].equals("-url")) { param = args[++i]; dbr.readUrl(crawlDb, param, job); @@ -833,6 +847,7 @@ public Object query(Map args, Configuration conf, String type, S Integer retry = null; String status = null; String expr = null; + Float sample = null; if (args.containsKey("format")) { format = args.get("format"); } @@ -848,7 +863,10 @@ public Object query(Map args, Configuration conf, String type, S if (args.containsKey("expr")) { expr = args.get("expr"); } - processDumpJob(crawlDb, output, new NutchJob(conf), format, regex, status, retry, expr); + if (args.containsKey("sample")) { + sample = Float.parseFloat(args.get("sample")); +} + processDumpJob(crawlDb, output, new NutchJob(conf), format, regex, status, retry, expr, sample); File dumpFile = new File(output+"/part-0"); return dumpFile; } @@ -886,4 +904,4 @@ public Object query(Map args, Configuration conf, String type, S } return results; } -} \ No newline at end of file +}
[jira] [Updated] (NUTCH-2463) Enable sampling CrawlDB
[ https://issues.apache.org/jira/browse/NUTCH-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2463: --- Fix Version/s: 1.14 > Enable sampling CrawlDB > --- > > Key: NUTCH-2463 > URL: https://issues.apache.org/jira/browse/NUTCH-2463 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Yossi Tamari >Priority: Minor > Fix For: 1.14 > > > CrawlDB can grow to contain billions of records. When that happens *readdb > -dump* is pretty useless, and *readdb -topN* can run for ages (and does not > provide a statistically correct sample). > We should add a parameter *-sample* to *readdb -dump* which is followed by a > number between 0 and 1, and only that fraction of records from the CrawlDB > will be processed. > The sample should be statistically random, and all the other filters should > be applied on the sampled records. -- This message was sent by Atlassian JIRA (v6.4.14#64029)