[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1726:
--

Attachment: NUTCH-1726-trunk-v2.patch

add a test case to check HeadingsFilter patch. :)

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Tejas Patil
Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
have some significant patches since 1.7.
+1 for new release. I would be happy to volunteer / help.

Thanks,
Tejas



On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi guys,
>
> At least 2 of the issues that Seb and I had mentioned have now been
> committed. What about releasing 1.8 from trunk? If so, any volunteers?
>
> Julien
>
>
> On 2 December 2013 21:02, Sebastian Nagel wrote:
>
>> Hi,
>>
>> +1 to release soon (this year, or early next year)
>>
>> > and probably a few others but they could also be done later.
>> At least, these should be done before releasing:
>> NUTCH-1646 IndexerMapReduce to consider DB status
>> NUTCH-1413 Record response time
>>
>> Sebastian
>>
>> On 11/28/2013 05:49 PM, Julien Nioche wrote:
>> > Hi Lewis
>> >
>> > We've done quite a few things in 1.x since the previous release (e.g.
>> generic deduplication,
>> > removing indexer.solr package, etc...)  and the next 2.x release will
>> be after the changes to GORA
>> > have been made, tested and used on the Nutch side so that could be
>> quite a while.
>> >
>> > I am neutral as to whether we should do a 1.x release now. There are
>> some minor issues that we could
>> > do in 1.x before the next release like :
>> > * https://issues.apache.org/jira/browse/NUTCH-1360
>> > * https://issues.apache.org/jira/browse/NUTCH-1676
>> > and probably a few others but they could also be done later.
>> >
>> > Let's hear what others think.
>> >
>> > Thanks
>> >
>> > Julien
>> >
>> >
>> >
>> >
>> > On 28 November 2013 16:34, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com
>> > > wrote:
>> >
>> > Hi Folks,
>> > Thread says it all.
>> > There are some hot tickets over in Gora right now so I think
>> holding off the next while for a
>> > 2.x release would be wise.
>> > I can spin the RC for trunk tonight/tomorrow/weekend if we get the
>> thumbs up.
>> > Ta
>> > Lewis
>> >
>> > --
>> > /Lewis/
>> >
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> > http://twitter.com/digitalpebble
>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


[jira] [Updated] (NUTCH-1727) Length of the Tlds

2014-02-12 Thread Sertac TURKEL (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1727:
-

Attachment: NUTCH-1727.patch

I had a look domain-suffix.xml  and I saw the longest domain suffix can include 
8 characters(.internal). By default value, I picked 8 for this reason and I 
prepared a patch.  Could you review my patch?

> Length of the Tlds
> --
>
> Key: NUTCH-1727
> URL: https://issues.apache.org/jira/browse/NUTCH-1727
> Project: Nutch
>  Issue Type: Bug
>Reporter: Sertac TURKEL
>Priority: Minor
> Fix For: 2.1
>
> Attachments: NUTCH-1727.patch
>
>
> Length of the tld  should be selectable, there is some available tld's like 
> .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1727) Length of the Tlds

2014-02-12 Thread Sertac TURKEL (JIRA)
Sertac TURKEL created NUTCH-1727:


 Summary: Length of the Tlds
 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.1


Length of the tld  should be selectable, there is some available tld's like 
.travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Julien Nioche
Hi guys,

At least 2 of the issues that Seb and I had mentioned have now been
committed. What about releasing 1.8 from trunk? If so, any volunteers?

Julien


On 2 December 2013 21:02, Sebastian Nagel wrote:

> Hi,
>
> +1 to release soon (this year, or early next year)
>
> > and probably a few others but they could also be done later.
> At least, these should be done before releasing:
> NUTCH-1646 IndexerMapReduce to consider DB status
> NUTCH-1413 Record response time
>
> Sebastian
>
> On 11/28/2013 05:49 PM, Julien Nioche wrote:
> > Hi Lewis
> >
> > We've done quite a few things in 1.x since the previous release (e.g.
> generic deduplication,
> > removing indexer.solr package, etc...)  and the next 2.x release will be
> after the changes to GORA
> > have been made, tested and used on the Nutch side so that could be quite
> a while.
> >
> > I am neutral as to whether we should do a 1.x release now. There are
> some minor issues that we could
> > do in 1.x before the next release like :
> > * https://issues.apache.org/jira/browse/NUTCH-1360
> > * https://issues.apache.org/jira/browse/NUTCH-1676
> > and probably a few others but they could also be done later.
> >
> > Let's hear what others think.
> >
> > Thanks
> >
> > Julien
> >
> >
> >
> >
> > On 28 November 2013 16:34, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com
> > > wrote:
> >
> > Hi Folks,
> > Thread says it all.
> > There are some hot tickets over in Gora right now so I think holding
> off the next while for a
> > 2.x release would be wise.
> > I can spin the RC for trunk tonight/tomorrow/weekend if we get the
> thumbs up.
> > Ta
> > Lewis
> >
> > --
> > /Lewis/
> >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Is it possible to run Nutch 2.x with httpclient 3 and 4 simultaneously?

2014-02-12 Thread d_k
I'm looking into upgrading the httpclient version used by
protocol-httpclient because there are some fixes in the 4.x branch that I
need and I realized it will be impossible to do without braking gora 3
support of hbase 90.x because the latter still uses httpclient 3 so I was
wondering how bad will it be if i'll upgrade the dependencies to httpclient
4, change protocol-httpclient to use the version 4 API without touching any
gora/hbase code considering the package name has changed so the new library
should not effect code that didn't import the new package and have them
loaded and live side by side?

>From the looks of it, having two versions of the same library sounds like a
bad idea but i'll be happy to hear an opinion on the subject.


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2014-02-12 Thread Matzz (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899044#comment-13899044
 ] 

Matzz commented on NUTCH-961:
-

{quote}We don't use it BP anymore {quote}

BP integration will be totally abandoned? Are there any plans to use other 
content extractor in favour of Boilerpipe?

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 2.3, 1.8
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
> NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
> NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
> NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1726:
-

Attachment: NUTCH-1726-trunk.patch

Patch for trunk, fixing the problem.

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1726:


 Summary: HeadingsFilter does not find nested nodes
 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8


Filter won't find:
{code}
apache nutch
{code}

The getNodeValue() tries to read data from children but should traverse nodes 
instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent

2014-02-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899029#comment-13899029
 ] 

Sebastian Nagel commented on NUTCH-1718:


Hi [~tejasp], +1 to "redefine" {{http.robots.agents}} as "additional agent 
names": makes it simpler for polite users which definitely should use the same 
user agent name in HTTP header and robots.txt.

> update description of property http.robots.agent
> 
>
> Key: NUTCH-1718
> URL: https://issues.apache.org/jira/browse/NUTCH-1718
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> 
>   http.robots.agents
>   *
>   The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-02-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872195#comment-13872195
 ] 

Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:27 AM:
--

I create indexer plugin of SolrCloud.


was (Author: icebergx5):
I create indexer plugin of SolrCloud. This patch can apply after 
https://issues.apache.org/jira/browse/NUTCH-1568.

> Indexer Plugin for Solr Cloud
> -
>
> Key: NUTCH-1662
> URL: https://issues.apache.org/jira/browse/NUTCH-1662
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>Affects Versions: 2.3
>Reporter: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1662.patch
>
>
> In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
> This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-02-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872195#comment-13872195
 ] 

Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:23 AM:
--

I create indexer plugin of SolrCloud. This patch can apply after 
https://issues.apache.org/jira/browse/NUTCH-1568.


was (Author: icebergx5):
I create indexer plugin of SolrCloud. This patch can apply after NUTCH-1655.

> Indexer Plugin for Solr Cloud
> -
>
> Key: NUTCH-1662
> URL: https://issues.apache.org/jira/browse/NUTCH-1662
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>Affects Versions: 2.3
>Reporter: Talat UYARER
> Fix For: 2.3
>
> Attachments: NUTCH-1662.patch
>
>
> In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
> This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)