date:20110406

[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file

2011-04-06 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016324#comment-13016324
 ] 

Markus Jelsma commented on NUTCH-977:
-

Any objections devs? Everythings is working fine with these patches.

 SolrMappingReader uses hardcoded configuration parameter name for mapping file
 --

 Key: NUTCH-977
 URL: https://issues.apache.org/jira/browse/NUTCH-977
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-977-1.3.patch, NUTCH-977-trunk.patch


 Because the SolrMappingReader uses a hard coded value for the name of the 
 mapping file configuration parameter it actually works. It should rely on 
 SolrConstants instead of using a hard coded value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

2011-04-06 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016323#comment-13016323
 ] 

Markus Jelsma commented on NUTCH-976:
-

Any objections devs? Everythings is working fine with these patches.

 SolrIndex constants in wrong namespace (or prefix)
 --

 Key: NUTCH-976
 URL: https://issues.apache.org/jira/browse/NUTCH-976
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2, 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.3, 2.0

 Attachments: NUTCH-976-1.3-trunk.patch


 The shipped nutch-default.xml configuration file uses solrindex. as namespace 
 for configuration parameters but the namespace (or prefix) in SolrConstants 
 is solr instead. It should be solrindex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ammar Shadiq updated NUTCH-978:
---

Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Proposal for Google Summer of Code 2011
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

haven't found any mentor yet :-(

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Key: NUTCH-978
URL: https://issues.apache.org/jira/browse/NUTCH-978
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.2
Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Labels: gsoc
Fix For: 2.0

Attachments:
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Original Estimate: 1680h
Remaining Estimate: 1680h

Nutch use parse-html plugin to parse web pages, it process the contents of
the web page by removing html tags and component like javascript and css and
leaving the extracted text to be stored on the index. Nutch by default
doesn't have the capability to select certain atomic element on an html page,
like certain tags, certain content, some part of the page, etc.
A html page have a tree-like xml pattern with html tag as its branch and text
as its node. This branch and node could be extracted using XPath. XPath
allowing us to select a certain branch or node of an XML and therefore could
be used to extract certain information and treat it differently based on its
content and the user requirements. Furthermore a web domain like news website
usually have a same html code structure for storing the information on its
web pages. This same html code structure could be parsed using the same XPath
query and retrieve the same content information element. All of the XPath
query for selecting various content could be stored on a XPath Configuration
File.
The purpose of nutch are for various web source, not all of the web page
retrieved from those various source have the same html code structure, thus
have to be threated differently using the correct XPath Configuration. The
selection of the correct XPath configuration could be done automatically
using regex by matching the url of the web page with valid url pattern for
that xpath configuration.
This automatic mechanism allow the user of nutch to process various web page
and get only certain information that user wants therefore making the index
more accurate and its content more flexible.
The component for this idea have been tested on nutch 1.2 for selecting
certain elements on various news website for the purpose of document
clustering. This includes a Configuration Editor Application build using
NetBeans 6.9 Application Framework. though its need a few debugging.
http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ammar Shadiq updated NUTCH-978:
---

Priority: Minor (was: Major)

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Attachments:
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Original Estimate: 1680h
Remaining Estimate: 1680h

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

2011-04-06 Thread Gabriele Kahlout (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016465#comment-13016465
 ] 

Gabriele Kahlout commented on NUTCH-967:


Julien, why doesn't your patch modify tika-parse plugin.xml to use 
tika-parsers-0.9 instead of tika-parsers-0.7?
Trying to do so I get exception (for both html and pdfs): 

Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

It's enough to set it back to 0.7 to have it work. This is not an issue with 
html only but also pdfs.

 Upgrade to Tika 0.9
 ---

 Key: NUTCH-967
 URL: https://issues.apache.org/jira/browse/NUTCH-967
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Assignee: Julien Nioche
 Fix For: 1.3, 2.0

 Attachments: NUTCH-967-1.3.patch




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2011-04-06 Thread Ammar Shadiq (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ammar Shadiq updated NUTCH-978:
---

Attachment: app_screenshoot_url_regex_filter.png
app_screenshoot_source_view.png
app_screenshoot_configuration_result_anchor.png
app_screenshoot_configuration_result.png

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Attachments:
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf,
app_screenshoot_configuration_result.png,
app_screenshoot_configuration_result_anchor.png,
app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

Original Estimate: 1680h
Remaining Estimate: 1680h

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-trunk #1449

2011-04-06 Thread Apache Hudson Server

See https://hudson.apache.org/hudson/job/Nutch-trunk/1449/

--
[...truncated 1009 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A

[jira] [Commented] (NUTCH-977) SolrMappingReader uses hardcoded configuration parameter name for mapping file

[jira] [Commented] (NUTCH-976) SolrIndex constants in wrong namespace (or prefix)

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Build failed in Jenkins: Nutch-trunk #1449

7 matches

Site Navigation

Mail list logo

Footer information