Re: [DISCUSS] Release Trunk

2014-02-14 Thread Sebastian Nagel
Hi,

-1 also from me for now.

Beside SegmentMerger (NUTCH-1113) there is a problem in indexer 
(NUTCH-1706/NUTCH-1646)
which should be fixed. I hope to tackle both issues soon.

Sebastian


On 02/13/2014 10:19 AM, Markus Jelsma wrote:
> Seems some of my mails to the list are not coming through. I am -1 on release 
> from trunk as is. The segment merger is still broken and in my opinion we 
> cannot push yet another release with a broken segment merger.
> 
> Markus
> 
> -Original message-
> From: Tejas Patil
> Sent: Thursday 13th February 2014 1:33
> To: dev@nutch.apache.org
> Subject: Re: [DISCUSS] Release Trunk
> 
> Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have 
> some significant patches since 1.7.
> 
> +1 for new release. I would be happy to volunteer / help.
> 
> Thanks,
> 
> Tejas
> 
> On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche  > wrote:
> 
> Hi guys,
> 
> At least 2 of the issues that Seb and I had mentioned have now been 
> committed. What about releasing 1.8 from trunk? If so, any volunteers?
> 
> Julien
> 
> On 2 December 2013 21:02, Sebastian Nagel  > wrote:
> 
> Hi,
> 
> +1 to release soon (this year, or early next year)
> 
>> and probably a few others but they could also be done later.
> 
> At least, these should be done before releasing:
> 
> NUTCH-1646 IndexerMapReduce to consider DB status
> 
> NUTCH-1413 Record response time
> 
> Sebastian
> 
> On 11/28/2013 05:49 PM, Julien Nioche wrote:
> 
>> Hi Lewis
> 
>>
> 
>> Weve done quite a few things in 1.x since the previous release (e.g. generic 
>> deduplication,
> 
>> removing indexer.solr package, etc...)  and the next 2.x release will be 
>> after the changes to GORA
> 
>> have been made, tested and used on the Nutch side so that could be quite a 
>> while.
> 
>>
> 
>> I am neutral as to whether we should do a 1.x release now. There are some 
>> minor issues that we could
> 
>> do in 1.x before the next release like :
> 
>> * https://issues.apache.org/jira/browse/NUTCH-1360 
>> 
> 
>> * https://issues.apache.org/jira/browse/NUTCH-1676 
>> 
> 
>> and probably a few others but they could also be done later.
> 
>>
> 
>> Lets hear what others think.
> 
>>
> 
>> Thanks
> 
>>
> 
>> Julien
> 
>>
> 
>>
> 
>>
> 
>>
> 
>> On 28 November 2013 16:34, Lewis John Mcgibbney > 
> 
>> >> wrote:
> 
>>
> 
>> Hi Folks,
> 
>> Thread says it all.
> 
>> There are some hot tickets over in Gora right now so I think holding off 
>> the next while for a
> 
>> 2.x release would be wise.
> 
>> I can spin the RC for trunk tonight/tomorrow/weekend if we get the 
>> thumbs up.
> 
>> Ta
> 
>> Lewis
> 
>>
> 
>> --
> 
>> /Lewis/
> 
>>
> 
>>
> 
>>
> 
>>
> 
>> --
> 
>> *
> 
>> *Open Source Solutions for Text Engineering
> 
>>
> 
>> http://digitalpebble.blogspot.com/ 
> 
>> http://www.digitalpebble.com 
> 
>> http://twitter.com/digitalpebble 
> 
> --
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/ 
> 
> http://www.digitalpebble.com 
> http://twitter.com/digitalpebble 
> 
> 



[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901363#comment-13901363
 ] 

Markus Jelsma commented on NUTCH-1726:
--

Hi lufeng!

I don't understand, i have a clean Apache Nutch headings plugin, the same test 
fails for my patch and your patch. 

{code}
Testcase: testIt took 1.489 sec
Testcase: testMultiValueMetatags took 0.185 sec
FAILED
One value of metatag with multiple values is missing: Test header h2 with span
junit.framework.AssertionFailedError: One value of metatag with multiple values 
is missing: Test header h2 with span
at 
org.apache.nutch.parse.headings.TestHeadingsParseFilter.testMultiValueMetatags(TestHeadingsParseFilter.java:97)

{code}

I added truncate because perhaps some users may want to ignore long headers 
instead of truncating them. If i get a header containing 2kb of text, i think i 
would like to skip it, not truncate.

Markus

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

2014-02-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13901313#comment-13901313
 ] 

Lewis John McGibbney commented on NUTCH-1525:
-

[~sabio], thank you for the patch. I totally forgot about this issue. 
Can we verify if we are able to derive Hadoop counters as well as/instead of 
simple logging?
If we can obtain counters then it is much easier to analyze the number of 
external links we filter.

> Generator to record external links even when  db.ignore.external.links set to 
> true
> --
>
> Key: NUTCH-1525
> URL: https://issues.apache.org/jira/browse/NUTCH-1525
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.4
>
> Attachments: nutch-logExternal.patch
>
>
> When fetching pages from specific domains we have various options e.g. use 
> urlfilters, set the above property to true before injecting urls into the 
> webdb etc. However with the former, it is recognised that complex regex can 
> slow down processing and with the latter it means we disregard a number of 
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future 
> processing, although the wiki suggests that a very small patch to the 
> generator code can allow you to log these links to hadoop.log. although this 
> is better, a more robusts storage mechanism would be preferred. This may tie 
> in with custom counters we've already specified or may require new counters 
> to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

2014-02-14 Thread Dmitry Cherniachenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Cherniachenko updated NUTCH-1525:


Attachment: nutch-logExternal.patch

Attached the patch for Nutch 1.7

With it applied you can add the following to log4j.properties
{code}
log4j.logger.org.apache.nutch.parse.ParseOutputFormat.externalLinks=INFO,extlinks

log4j.appender.extlinks=org.apache.log4j.DailyRollingFileAppender
log4j.appender.extlinks.File=${hadoop.log.dir}/external-links.log
log4j.appender.extlinks.DatePattern=.-MM-dd
log4j.appender.extlinks.layout=org.apache.log4j.PatternLayout
log4j.appender.extlinks.layout.ConversionPattern=%m%n
{code}

And then all the ignored external links will be logged cleanly to 
external-links.log

> Generator to record external links even when  db.ignore.external.links set to 
> true
> --
>
> Key: NUTCH-1525
> URL: https://issues.apache.org/jira/browse/NUTCH-1525
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.4
>
> Attachments: nutch-logExternal.patch
>
>
> When fetching pages from specific domains we have various options e.g. use 
> urlfilters, set the above property to true before injecting urls into the 
> webdb etc. However with the former, it is recognised that complex regex can 
> slow down processing and with the latter it means we disregard a number of 
> urls which could potentially become useful in the future.
> Unfortunately there is no way to record external links encountered for future 
> processing, although the wiki suggests that a very small patch to the 
> generator code can allow you to log these links to hadoop.log. although this 
> is better, a more robusts storage mechanism would be preferred. This may tie 
> in with custom counters we've already specified or may require new counters 
> to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)