date:20120221

[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread behnam nikbakht (Created) (JIRA)

Generator should not generate filter and not found and denied and gone and 
permanently moved pages
--

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht


Generator should not generate filter and not found and denied and gone and 
permanently moved pages.
in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
against special states of fetch like not found, and not generate them again.
so we can add a status in CrawlDatum that indicates invalid urls, and set this 
status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread behnam nikbakht (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1288:
---

Attachment: NUTCH-1288.patch

 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages
 --

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1288.patch


 Generator should not generate filter and not found and denied and gone and 
 permanently moved pages.
 in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
 against special states of fetch like not found, and not generate them again.
 so we can add a status in CrawlDatum that indicates invalid urls, and set 
 this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread Julien Nioche (Resolved) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche resolved NUTCH-1288.
--

Resolution: Invalid

This is not the right way to do. If you don't want to re-try such pages then
implement a custom fetch schedule - don't hack the AbstractFetchSchedule as you
do.
Hardcoding the schedule policy forces people to use Nutch the way you want to
use it, not a good idea. Moreover your patch removes useful information about
the status of a page to give a more generic (and dubious value).

Generator should not generate filter and not found and denied and gone and
permanently moved pages
--

Key: NUTCH-1288
URL: https://issues.apache.org/jira/browse/NUTCH-1288
Project: Nutch
Issue Type: Bug
Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht
Attachments: NUTCH-1288.patch

Generator should not generate filter and not found and denied and gone and
permanently moved pages.
in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked
against special states of fetch like not found, and not generate them again.
so we can add a status in CrawlDatum that indicates invalid urls, and set
this status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-21 Thread Julien Nioche (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212502#comment-13212502
 ] 

Julien Nioche commented on NUTCH-1281:
--

Behnam,

I suppose that you are seeing this issue when using the Crawl class but not 
when using a script. The reason for this is that the timeout mechanism prevents 
the parser to get locked with files which have been truncated or put the 
underlying parser library in a spin. When using the Crawl class, these runaway 
threads are not cleared,  accumulate and take all the memory left. The Crawl 
class is planned to be replaced by a shell script which will remove this issue 
and allow people to modify the process easily (+ make the pipeline easier to 
understand)

Or are you seeing this when using the Parse command in a script? Again, the 
timeout mechanism should prevent the parser to crash.

Now if the issue is to prevent the Tika plugin to process certain types, a 
better approach would be to filter the docs prior to parsing based on their 
mime-types which we now can access from the crawldb metadata. The trouble is 
that the URLFilters consider only the string of a URL and not any metadata. We 
could change the API of URLFilters? What other metadata would we take into 
account for filtering?

Another approach would be to filter based on the content type in ParseUtil - so 
that it is used not only for Tika but for any other parser and have a blacklist 
of mimetypes that would not be parsed. 

Any thoughts?





 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht

 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212523#comment-13212523
]

Ammar Shadiq commented on NUTCH-978:

Hi Lewis,

Since the proposal is not accepted, I'm using my summer time to work on my
undergrad thesis. I'm graduated from collage recently, and the time has freed
up, so I'd love to help, and it's awesome if we could collaborate.

thanks,
Ammar

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Key: NUTCH-978
URL: https://issues.apache.org/jira/browse/NUTCH-978
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.2
Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
Labels: gsoc2011, mentor
Fix For: nutchgora

Attachments:
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf,
app_guardian_ivory_coast_news_exmpl.png,
app_screenshoot_configuration_result.png,
app_screenshoot_configuration_result_anchor.png,
app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png,
for_GSoc.zip

Original Estimate: 1,680h
Remaining Estimate: 1,680h

Nutch use parse-html plugin to parse web pages, it process the contents of
the web page by removing html tags and component like javascript and css and
leaving the extracted text to be stored on the index. Nutch by default
doesn't have the capability to select certain atomic element on an html page,
like certain tags, certain content, some part of the page, etc.
A html page have a tree-like xml pattern with html tag as its branch and text
as its node. This branch and node could be extracted using XPath. XPath
allowing us to select a certain branch or node of an XML and therefore could
be used to extract certain information and treat it differently based on its
content and the user requirements. Furthermore a web domain like news website
usually have a same html code structure for storing the information on its
web pages. This same html code structure could be parsed using the same XPath
query and retrieve the same content information element. All of the XPath
query for selecting various content could be stored on a XPath Configuration
File.
The purpose of nutch are for various web source, not all of the web page
retrieved from those various source have the same html code structure, thus
have to be threated differently using the correct XPath Configuration. The
selection of the correct XPath configuration could be done automatically
using regex by matching the url of the web page with valid url pattern for
that xpath configuration.
This automatic mechanism allow the user of nutch to process various web page
and get only certain information that user wants therefore making the index
more accurate and its content more flexible.
The component for this idea have been tested on nutch 1.2 for selecting
certain elements on various news website for the purpose of document
clustering. This includes a Configuration Editor Application build using
NetBeans 6.9 Application Framework. though its need a few debugging.
http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212542#comment-13212542
]

Ammar Shadiq commented on NUTCH-978:

I'll send you an email.

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Chris A. Mattmann (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212570#comment-13212570
]

Chris A. Mattmann commented on NUTCH-978:
-

Guys, I think it's fine to keep the conversation on list, in fact, I'd favor it
unless there is a specific reason to take it there?

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212576#comment-13212576
]

Lewis John McGibbney commented on NUTCH-978:

No bother Chris. So far questions that have been asked
1. provide a quick run down on the issue, summarizing all of the above
2. what were the motivations, purpose and technical challenges encountered
whilst working on it?
3. Why did the issue drop away and what do you think is required to get it back
on track and possibly in the codebase?

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212582#comment-13212582
]

Lewis John McGibbney commented on NUTCH-978:

Replies:

1 2. The main motivation of this issue is for processing news document
required for my undergrad thesis of Bahasa Indonesia news text
clustering, it's needed a prepossessing to extract the title, news
content, date, related news link separately.

2. The most biggest technical challenge for me is processing the web page
so it could be parsered as an XML document and could be queried by
XPath.

3. The issue is drop away, because with a small tweak a could get it
working for only my thesis requirements, i haven't tested it with
web page other than the web pages i used for my thesis so i think it's
not anyway nearly finished yet. And since the proposal is not accepted
as a GSOC project, i lost motivation to continue to work on this issue
and decided to work on my thesis instead.

related issue : https://issues.apache.org/jira/browse/NUTCH-185

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Lewis John McGibbney (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212584#comment-13212584
]

Lewis John McGibbney commented on NUTCH-978:

Generally speaking the plugin sounds sounds really useful, the only problem I
see is that it is very specific and for it to be integrated into the code base
usually we need to make it specific enough to address some given task fully and
in a well defined and well justified manner, but we also need to make it
general enough to be used in many different contexts. This increases usability
and user feedback as well engagement.

4. With regards to the biggest technical challenge being the processing of web
page's, how far did you get with this? We're you able to process it with enough
precision to satisfy your requirements?

5. How were you querying it with XPath? You cannot query with XPath, but
instead with XQuery. Do you maybe mean that this enabled you to navigate the
document and address various parts of it is XPath?

6. Ok I understand why it has crumbled slightly, but I think if the code is
there is would be a huge waster if we didn't try to revive it, possibly getting
it integrated into the code base, and maybe getting it added as a contrib
component but not shipping it within the core codebase if the former was not a
viable option.

I've had a look at NUTCH-185, but I think we can discard this as it was
addressed a very long time ago, it's also already integrated into the codebase.
I was referring more to Jira issues which were currently open, which we could
maybe merge or combine to give this a more general and possibly more justified
arguement for inclusion in the codebase... what do you think? Does NUTCH-585
fit this?

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly,

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-21 Thread Ammar Shadiq (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212605#comment-13212605
]

Ammar Shadiq commented on NUTCH-978:

4.With regards to the biggest technical challenge being the processing of web
page's, how far did you get with this? We're you able to process it with
enough precision to satisfy your requirements?

I get it to work for my text clustering algorithm, the application screenshoot
provided here:
http://www.facebook.com/media/set/?set=a.2075564646205.124550.1157621543type=3l=7313965254\.
Yes, it's quite satisfactory.

In my understanding there are 3 ways to query an XML document, that is using
XPath, XQuery and XLST, I'm sorry if i get it wrong. For navigating various
parts of the page i uses java HTML parse lister extending
HTMLEditorKit.ParserCallback and then displaying it on the editor application
(some kind of chromium Inspect element), this makes the web page structure
visible and thus making the XPath expression easier to make.

6. Ok I understand why it has crumbled slightly, but I think if the code is
there is would be a huge waster if we didn't try to revive it, possibly
getting it integrated into the code base, and maybe getting it added as a
contrib component but not shipping it within the core codebase if the former
was not a viable option.

I totally agree

As for Nutch 585, i think it's different in the idea that is Nutch 585 trying
to block certain parts. This idea instead, only retrieve certain parts and in
addition store it in certain lucene field (i havent looked into the Solr
implementation yet) thus automatically discarding the rest.

[GSoC 2011] A Plugin for extracting certain element of a web page on html
page parsing.
---

Original Estimate: 1,680h
Remaining Estimate: 1,680h

I think I found a bug -- multiple_values_encountered_for_non_multiValued_field_title

2012-02-21 Thread kaveh minooie

so I've been getting this error 
multiple_values_encountered_for_non_multiValued_field_title every once 
in a while when I am trying to run solrindex. I can now say that this is 
being caused by index-more plug in (MoreIndexingFilter.java)


private NutchDocument resetTitle(NutchDocument doc, ParseData data, 
String url) {

String contentDisposition = 
data.getMeta(Metadata.CONTENT_DISPOSITION);
if (contentDisposition == null)
  return doc;

for (int i=0; ipatterns.length; i++) {
  Matcher matcher = patterns[i].matcher(contentDisposition);
  if (matcher.find()) {
doc.add(title, matcher.group(1));
break;
  }
}
   return doc;
  }


the problem here is that in my case this function is not reseting but it 
is just adding a new title. it seems that the original idea was that if 
CONTENT_DISPOSITION exist then the document will not have a title set 
from other plug ins (namely index-basic). unfortunately this seems not 
to be always the case as you can see by running this command:


bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html

what i do get (the part that is relevant) is:


tstamp :Tue Feb 21 13:18:13 PST 2012
type :  text/html
type :  text
type :  html
date :  Tue Feb 21 13:18:13 PST 2012
url :   http://www.2modern.com/site/gift-registry.html
content :	2Modern Gift Registry  Modern Furniture  Lighting items 
in cart 0 checkout Returning 2Modern cu

user_ranking :  25.0
title : 2Modern Gift Registry
title : gift-registry.html
plutoz_ranking :10.0
categories :Furniture Home
contentLength : 12924

and as you can see there are 2 titles. I think it would be very easy to 
fix that. just check to see if a title exist already before setting the 
name of the file as title:


if (contentDisposition == null || null != doc.getField(title))
  return doc;


 or if the substitution must happen in presence of CONTENT_DISPOSITION, 
at least remove the old one:


if (matcher.find()) {
doc.remove(title);
doc.add(title, matcher.group(1));
break;
 }


 now that being said, the real problem here is why NutchDocument 
doesn't observe the schema.xml file and alway assumes that all fields 
are multi value?


public void add(String name, Object value) {
53  NutchField field = fields.get(name);
54  if (field == null) {
55field = new NutchField(value);
56fields.put(name, field);
57  } else {
58   field.add(value);  ---
59  }
60}

--
Kaveh Minooie

www.plutoz.com

Build failed in Jenkins: Nutch-nutchgora #169

2012-02-21 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/169/

--
[...truncated 2636 lines...]
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-pass

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file =

Jenkins build is back to normal : nutch-trunk-maven #161

2012-02-21 Thread Apache Jenkins Server

See https://builds.apache.org/job/nutch-trunk-maven/161/

[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

[jira] [Updated] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

[jira] [Resolved] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

I think I found a bug -- multiple_values_encountered_for_non_multiValued_field_title

Build failed in Jenkins: Nutch-nutchgora #169

Jenkins build is back to normal : nutch-trunk-maven #161

14 matches

Site Navigation

Mail list logo

Footer information