Re: RSS-fecter and index individul-how can i realize this function

kauu Sun, 04 Feb 2007 19:30:59 -0800

I've change code like what u said, but i get an exception like this.
why, why is the MD5Signature class's exception



2007-02-05 11:28:38,453 WARN  feedparser.FeedFilter (
FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities
2007-02-05 11:28:39,390 INFO  crawl.SignatureFactory (
SignatureFactory.java:getSignature(45)) - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-02-05 11:28:40,078 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120))
- job_f6j55m
java.lang.NullPointerException
   at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:121)
   at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:87)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
   at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:112)


On 2/3/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:


Gal, Chris, Kauu,

So, if I understand correctly, you need a way to pass information along
the fetches, so that when Nutch fetches a feed entry, its <item> value
previously fetched is available.

This is how I tackled the issue:
- extend Outlinks.java and allow to create outlinks with more meta data.
So, in your feed parser, use this way to create outlinks
- pass on the metadata through ParseOutputFormat.java and Fetcher.java
- retrieve the metadata in HtmlParser.java and use it

This is very tedious, will blow the size of your outlinks db, makes
changes in the core code of Nutch, etc... But this is the only way I
came up with...
If someone sees a better way, please let me know :-)

Sample code, for Nutch 0.8.x :

Outlink.java
+  public Outlink(String toUrl, String anchor, String entryContents,
Configuration conf) throws MalformedURLException {
+      this.toUrl = new
UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
+      this.anchor = anchor;
+
+      this.entryContents= entryContents;
+  }
and update the other methods

ParseOutputFormat.java, around lines 140
+            // set outlink info in metadata ME
+            String entryContents= links[i].getEntryContents();
+
+            if (entryContents.length() > 0) { // it's a feed entry
+                MapWritable meta = new MapWritable();
+                meta.put(new UTF8("entryContents"), new
UTF8(entryContents));//key/value
+                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval);
+                target.setMetaData(meta);
+            } else {
+                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval); // no meta
+            }

Fetcher.java, around l. 266
+      // add feed info to metadata
+      try {
+          String entryContents = datum.getMetaData().get(new
UTF8("entryContents")).toString();
+          metadata.set("entryContents", entryContents);
+      } catch (Exception e) { } //not found

HtmlParser.java
// get entry metadata
    String entryContents = content.getMetadata().get("entryContents");

HTH,
Renaud



Gal Nitzan wrote:
> Hi Chris,
>
> I'm sorry I wasn't clear enough. What I mean is that in the current
implementation:
>
> 1. The RSS (channels, items) page ends up as one Lucene document in the
index.
> 2. Indeed the links are extracted and each <item> link will be fetched
in the next fetch as a separate page and will end up as one Lucene document.
>
> IMHO the data that is needed i.e. the data that will be fetched in the
next fetch process is already available in the <item> element. Each <item>
element represents one web resource. And there is no reason to go to the
server and re-fetch that resource.
>
> Another issue that arises from rss feeds is that once the feed page is
fetched you can not re-fetch it until its "time to fetch" expired. The feeds
TTL is usually very short. Since for now in Nutch, all pages created equal
:) it is one more thing to think about.
>
> HTH,
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 01, 2007 7:01 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this
function
>
> Hi Gal, et al.,
>
>   I'd like to be explicit when we talk about what the issue with the RSS
> parsing plugin is here; I think we have had conversations similar to
this
> before and it seems that we keep talking around each other. I'd like to
get
> to the heart of this matter so that the issue (if there is an actual
one)
> gets addressed ;)
>
>   Okay, so you mention below that the thing that you see missing from
the
> current RSS parsing plugin is the ability to store data in the
CrawlDatum,
> and parse "it" in the next fetch phase. Well, there are 2 options here
for
> what you refer to as "it":
>
>  1. If you're talking about the RSS file, then in fact, it is parsed,
and
> its data is stored in the CrawlDatum, akin to any other form of content
that
> is fetched, parsed and indexed.
>
>  2. If you're talking about the item links within the RSS file, in fact,
> they are parsed (eventually), and their data stored in the CrawlDatum,
akin
> to any other form of content that is fetched, parsed, and indexed. This
is
> accomplished by adding the RSS items as Outlinks when the RSS file is
> parsed: in this fashion, we go after all of the links in the RSS file,
and
> make sure that we index their content as well.
>
> Thus, if you had an RSS file R that contained links in it to a PDF file
A,
> and another HTML page P, then not only would R get fetched, parsed, and
> indexed, but so would A and P, because they are item links within R.
Then
> queries that would match R (the physical RSS file), would additionally
match
> things such as P and A, and all 3 would be capable of being returned in
a
> Nutch query. Does this make sense? Is this the issue that you're talking
> about? Am I nuts? ;)
>
> Cheers,
>   Chris
>
>
>
>
> On 1/31/07 10:40 PM, "Gal Nitzan" <[EMAIL PROTECTED]> wrote:
>
>
>> Hi,
>>
>> Many sites provide RSS feeds for several reasons, usually to save
bandwidth,
>> to give the users concentrated data and so forth.
>>
>> Some of the RSS files supplied by sites are created specially for
search
>> engines where each RSS "item" represent a web page in the site.
>>
>> IMHO the only thing "missing" in the parse-rss plugin is storing the
data in
>> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
new
>> flag to CrawlDatum, that would flag the URL as "parsable" not
"fetchable"?
>>
>> Just my two cents...
>>
>> Gal.
>>
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, January 31, 2007 8:44 AM
>> To: nutch-dev@lucene.apache.org
>> Subject: Re: RSS-fecter and index individul-how can i realize this
function
>>
>> Hi there,
>>
>>   With the explanation that you give below, it seems like parse-rss as
it
>> exists would address what you are trying to do. parse-rss parses an RSS
>> channel as a set of items, and indexes overall metadata about the RSS
file,
>> including parse text, and index data, but it also adds each item (in
the
>> channel)'s URL as an Outlink, so that Nutch will process those pieces
of
>> content as well. The only thing that you suggest below that parse-rss
>> currently doesn't do, is to allow you to associate the metadata fields
>> category:, and author: with the item Outlink...
>>
>> Cheers,
>>   Chris
>>
>>
>>
>> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:
>>
>>
>>> thx for ur reply .
>>>
>> mybe i didn't tell clearly .
>>  I want to index the item as a
>>
>>> individual page .then when i search the some
>>>
>> thing for example "nutch-open
>>
>>> source", the nutch return a hit which contain
>>>
>>    title : nutch-open source
>>
>>
>>> description : nutch nutch nutch ....nutch  nutch
>>>
>>    url :
>>
>>> http://lucene.apache.org/nutch
>>>
>>    category : news
>>   author  : kauu
>>
>> so , is
>>
>>> the plugin parse-rss can satisfy what i need?
>>>
>> <item>
>>     <title>nutch--open
>>
>>> source</title>
>>>
>>    <description>
>>
>>>        nutch nutch nutch ....nutch
>>> nutch
>>>
>>>>     </description>
>>>>
>>>>
>>>>
>>>>
>>> <link>http://lucene.apache.org/nutch</link>
>>>
>>>>     <category>news
>>>>
>>> </category>
>>>
>>>>     <author>kauu</author>
>>>>
>>
>> On 1/31/07, Chris
>>
>>> Mattmann <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi there,
>>>
>>> I could most
>>> likely be of assistance, if you gave me some more
>>> information.
>>> For
>>> instance: I'm wondering if the use case you describe below is already
>>>
>>> supported by the current RSS parse plugin?
>>>
>>> The current RSS parser,
>>> parse-rss, does in fact index individual items
>>> that
>>> are pointed to by an
>>> RSS document. The items are added as Nutch Outlinks,
>>> and added to the
>>> overall queue of URLs to fetch. Doesn't this satisfy what
>>> you mention below?
>>> Or am I missing something?
>>>
>>> Cheers,
>>>   Chris
>>>
>>>
>>>
>>> On 1/30/07 6:01 PM,
>>> "kauu" <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>> Hi folks :
>>>>
>>>>    What's I want to
>>>>
>>> do is to separate a rss file into several pages .
>>>
>>>>   Just as what has
>>>>
>>> been discussed before. I want fetch a rss page and
>>> index
>>>
>>>> it as different
>>>>
>>> documents in the index. So the searcher can search the
>>>
>>>> Item's info as a
>>>>
>>> individual hit.
>>>
>>>>  What's my opinion create a protocol for fetch the rss
>>>>
>>> page and store it
>>> as
>>>
>>>> several one which just contain one ITEM tag .but
>>>>
>>> the unique key is the
>>> url ,
>>>
>>>> so how can I store them with the ITEM's link
>>>>
>>> tag as the unique key for a
>>>
>>>> document.
>>>>
>>>>   So my question is how to
>>>>
>>> realize this function in nutch-.0.8.x.
>>>
>>>>   I've check the code of the
>>>>
>>> plug-in protocol-http's code ,but I can't
>>>
>>>> find the code where to store a
>>>>
>>> page to a document. I want to separate
>>> the
>>>
>>>> rss page to several ones
>>>>
>>> before storing it as a document but several
>>> ones.
>>>
>>>>   So any one can
>>>>
>>> give me some hints?
>>>
>>>> Any reply will be appreciated !
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>   ITEM's structure
>>>>
>>>>  <item>
>>>>
>>>>
>>>>     <title>欧洲暴风雪后发制人 致航班
>>>>
>>> 延误交通混乱(组图)</title>
>>>
>>>>     <description>暴风雪横扫欧洲，导致多次航班延误 1
>>>>
>>> 月24日，几架民航客机在德
>>>
>>>> 国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部
>>>>
>>> 的慕尼黑机场
>>>
>>>> 清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...
>>>>
>>>>
>>>>     </description>
>>>>
>>>>
>>>>
>>>>
>>> <link>http://news.sohu.com/20070125
>>>
>>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>>>
>>> link>
>>>
>>>>     <category>搜狐焦点图新闻</category>
>>>>
>>>>
>>>>
>>>>
>>> <author>[EMAIL PROTECTED]
>>>
>>>> </author>
>>>>
>>>>
>>>>     <pubDate>Thu, 25 Jan 2007
>>>>
>>> 11:29:11 +0800</pubDate>
>>>
>>>>     <comments
>>>>
>>> http://comment.news.sohu.com
>>>
>>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>>>
>>> /comment/topic.jsp?id=247833847</comments>
>>>
>>>> </item
>>>>
>>>>
>>>>
>>>
>>>
>
> ______________________________________________
> Chris A. Mattmann
> [EMAIL PROTECTED]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>
>


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com



--
www.babatu.com

Re: RSS-fecter and index individul-how can i realize this function

Reply via email to