;-)

On 4 May 2011 16:26, Mattmann, Chris A (388J) <[email protected]
> wrote:

> Awww, sniff....bye parse-rss!
>
> On May 4, 2011, at 11:20 AM, <[email protected]> <[email protected]>
> wrote:
>
> > Author: jnioche
> > Date: Wed May  4 15:20:00 2011
> > New Revision: 1099483
> >
> > URL: http://svn.apache.org/viewvc?rev=1099483&view=rev
> > Log:
> > NUTCH-888 : Remove parse-rss
> >
> > Added:
> >    nutch/branches/branch-1.3/src/plugin/parse-tika/sample/rsstest.rss
> >
>  
> nutch/branches/branch-1.3/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestFeedParser.java
> > Removed:
> >    nutch/branches/branch-1.3/src/plugin/parse-rss/
> > Modified:
> >    nutch/branches/branch-1.3/CHANGES.txt
> >    nutch/branches/branch-1.3/conf/parse-plugins.xml
> >    nutch/branches/branch-1.3/src/plugin/build.xml
> >    nutch/branches/branch-1.3/src/plugin/parse-tika/build.xml
> >
> > Modified: nutch/branches/branch-1.3/CHANGES.txt
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/CHANGES.txt?rev=1099483&r1=1099482&r2=1099483&view=diff
> >
> ==============================================================================
> > --- nutch/branches/branch-1.3/CHANGES.txt (original)
> > +++ nutch/branches/branch-1.3/CHANGES.txt Wed May  4 15:20:00 2011
> > @@ -2,6 +2,8 @@ Nutch Change Log
> >
> > Release 1.3 - 4/21/2011
> >
> > +* NUTCH-888 Remove parse-rss and add tests for rss to parse-tika
> (jnioche)
> > +
> > * NUTCH-991 SolrDedup must issue a commit (markus)
> >
> > * NUTCH 986 SolrDedup fails due to date incorrect format (markus)
> >
> > Modified: nutch/branches/branch-1.3/conf/parse-plugins.xml
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/parse-plugins.xml?rev=1099483&r1=1099482&r2=1099483&view=diff
> >
> ==============================================================================
> > --- nutch/branches/branch-1.3/conf/parse-plugins.xml (original)
> > +++ nutch/branches/branch-1.3/conf/parse-plugins.xml Wed May  4 15:20:00
> 2011
> > @@ -27,9 +27,9 @@
> >       <mimeType name="*">
> >         <plugin id="parse-tika" />
> >       </mimeType>
> > -
> > +
> >       <mimeType name="application/rss+xml">
> > -         <plugin id="parse-rss" />
> > +         <plugin id="parse-tika" />
> >           <plugin id="feed" />
> >       </mimeType>
> >
> > @@ -65,7 +65,6 @@
> >
> >       <mimeType name="text/xml">
> >               <plugin id="parse-tika" />
> > -             <plugin id="parse-rss" />
> >               <plugin id="feed" />
> >       </mimeType>
> >
> > @@ -88,8 +87,6 @@
> >               <alias name="parse-html"
> >
> extension-id="org.apache.nutch.parse.html.HtmlParser" />
> >               <alias name="parse-js" extension-id="JSParser" />
> > -             <alias name="parse-rss"
> > -                     extension-id="org.apache.nutch.parse.rss.RSSParser"
> />
> >               <alias name="feed"
> >
> extension-id="org.apache.nutch.parse.feed.FeedParser" />
> >               <alias name="parse-swf"
> >
> > Modified: nutch/branches/branch-1.3/src/plugin/build.xml
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/build.xml?rev=1099483&r1=1099482&r2=1099483&view=diff
> >
> ==============================================================================
> > --- nutch/branches/branch-1.3/src/plugin/build.xml (original)
> > +++ nutch/branches/branch-1.3/src/plugin/build.xml Wed May  4 15:20:00
> 2011
> > @@ -45,7 +45,6 @@
> >      <ant dir="parse-ext" target="deploy"/>
> >      <ant dir="parse-js" target="deploy"/>
> >      <ant dir="parse-html" target="deploy"/>
> > -     <ant dir="parse-rss" target="deploy"/>
> >      <ant dir="parse-swf" target="deploy"/>
> >      <ant dir="parse-tika" target="deploy"/>
> >      <ant dir="parse-zip" target="deploy"/>
> > @@ -77,7 +76,6 @@
> >      <ant dir="protocol-file" target="test"/>
> >      <ant dir="protocol-httpclient" target="test"/>
> >      <!--ant dir="parse-ext" target="test"/-->
> > -     <ant dir="parse-rss" target="test"/>
> >      <ant dir="feed" target="test"/>
> >      <ant dir="parse-html" target="test"/>
> >      <ant dir="parse-swf" target="test"/>
> > @@ -119,7 +117,6 @@
> >     <ant dir="parse-ext" target="clean"/>
> >     <ant dir="parse-js" target="clean"/>
> >     <ant dir="parse-html" target="clean"/>
> > -    <ant dir="parse-rss" target="clean"/>
> >     <ant dir="parse-swf" target="clean"/>
> >     <ant dir="parse-tika" target="clean"/>
> >     <ant dir="parse-zip" target="clean"/>
> >
> > Modified: nutch/branches/branch-1.3/src/plugin/parse-tika/build.xml
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-tika/build.xml?rev=1099483&r1=1099482&r2=1099483&view=diff
> >
> ==============================================================================
> > --- nutch/branches/branch-1.3/src/plugin/parse-tika/build.xml (original)
> > +++ nutch/branches/branch-1.3/src/plugin/parse-tika/build.xml Wed May  4
> 15:20:00 2011
> > @@ -29,6 +29,7 @@
> >   <mkdir dir="${build.test}/data"/>
> >   <copy todir="${build.test}/data">
> >     <fileset dir="sample">
> > +      <include name="*.rss"/>
> >       <include name="*.rtf"/>
> >       <include name="*.pdf"/>
> >       <include name="ootest.*"/>
> >
> > Added: nutch/branches/branch-1.3/src/plugin/parse-tika/sample/rsstest.rss
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-tika/sample/rsstest.rss?rev=1099483&view=auto
> >
> ==============================================================================
> > --- nutch/branches/branch-1.3/src/plugin/parse-tika/sample/rsstest.rss
> (added)
> > +++ nutch/branches/branch-1.3/src/plugin/parse-tika/sample/rsstest.rss
> Wed May  4 15:20:00 2011
> > @@ -0,0 +1,37 @@
> > +<?xml version="1.0" encoding="ISO-8859-1" ?>
> > +<!--
> > +     Licensed to the Apache Software Foundation (ASF) under one or more
> > +     contributor license agreements.  See the NOTICE file distributed
> with
> > +     this work for additional information regarding copyright ownership.
> > +     The ASF licenses this file to You under the Apache License, Version
> 2.0
> > +     (the "License"); you may not use this file except in compliance
> with
> > +     the License.  You may obtain a copy of the License at
> > +
> > +     http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +     Unless required by applicable law or agreed to in writing, software
> > +     distributed under the License is distributed on an "AS IS" BASIS,
> > +     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> > +     See the License for the specific language governing permissions and
> > +     limitations under the License.
> > +-->
> > +<rss version="0.91">
> > +    <channel>
> > +      <title>TestChannel</title>
> > +      <link>http://test.channel.com/</link>
> > +      <description>Sample RSS File for Junit test</description>
> > +      <language>en-us</language>
> > +
> > +      <item>
> > +        <title>Home Page of Chris Mattmann</title>
> > +        <link>http://www-scf.usc.edu/~mattmann/</link>
> > +        <description>Chris Mattmann's home page</description>
> > +      </item>
> > +
> > +      <item>
> > +        <title>Awesome Open Source Search Engine</title>
> > +        <link>http://www.nutch.org/</link>
> > +        <description>Yup, that's what it is</description>
> > +      </item>
> > +   </channel>
> > +</rss>
> >
> > Added:
> nutch/branches/branch-1.3/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestFeedParser.java
> > URL:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestFeedParser.java?rev=1099483&view=auto
> >
> ==============================================================================
> > ---
> nutch/branches/branch-1.3/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestFeedParser.java
> (added)
> > +++
> nutch/branches/branch-1.3/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestFeedParser.java
> Wed May  4 15:20:00 2011
> > @@ -0,0 +1,130 @@
> > +/**
> > + * Licensed to the Apache Software Foundation (ASF) under one or more
> > + * contributor license agreements.  See the NOTICE file distributed with
> > + * this work for additional information regarding copyright ownership.
> > + * The ASF licenses this file to You under the Apache License, Version
> 2.0
> > + * (the "License"); you may not use this file except in compliance with
> > + * the License.  You may obtain a copy of the License at
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +package org.apache.nutch.tika;
> > +
> > +import junit.framework.TestCase;
> > +
> > +import org.apache.commons.logging.Log;
> > +import org.apache.commons.logging.LogFactory;
> > +import org.apache.hadoop.conf.Configuration;
> > +import org.apache.hadoop.io.Text;
> > +import org.apache.nutch.crawl.CrawlDatum;
> > +import org.apache.nutch.parse.Outlink;
> > +import org.apache.nutch.parse.Parse;
> > +import org.apache.nutch.parse.ParseData;
> > +import org.apache.nutch.parse.ParseException;
> > +import org.apache.nutch.parse.ParseUtil;
> > +import org.apache.nutch.parse.tika.TikaParser;
> > +import org.apache.nutch.protocol.Content;
> > +import org.apache.nutch.protocol.Protocol;
> > +import org.apache.nutch.protocol.ProtocolException;
> > +import org.apache.nutch.protocol.ProtocolFactory;
> > +import org.apache.nutch.util.NutchConfiguration;
> > +
> > +/**
> > + *
> > + * @author mattmann / jnioche
> > + *
> > + *         Test Suite for the RSS feeds with the {@link TikaParser}.
> > + *
> > + */
> > +public class TestFeedParser extends TestCase {
> > +
> > +     private String fileSeparator =
> System.getProperty("file.separator");
> > +
> > +     // This system property is defined in ./src/plugin/build-plugin.xml
> > +     private String sampleDir = System.getProperty("test.data", ".");
> > +
> > +     private String[] sampleFiles = { "rsstest.rss" };
> > +
> > +     public static final Log LOG =
> LogFactory.getLog(TestFeedParser.class
> > +                     .getName());
> > +
> > +     /**
> > +      * Default Constructor.
> > +      *
> > +      * @param name
> > +      *            The name of this {@link TestCase}.
> > +      */
> > +     public TestFeedParser(String name) {
> > +             super(name);
> > +     }
> > +
> > +     /**
> > +      * <p>
> > +      * The test method: tests out the following 2 asserts:
> > +      * </p>
> > +      *
> > +      * <ul>
> > +      * <li>There are 3 outlinks read from the sample rss file</li>
> > +      * <li>The 3 outlinks read are in fact the correct outlinks from
> the sample
> > +      * file</li>
> > +      * </ul>
> > +      */
> > +     public void testIt() throws ProtocolException, ParseException {
> > +             String urlString;
> > +             Protocol protocol;
> > +             Content content;
> > +             Parse parse;
> > +
> > +             Configuration conf = NutchConfiguration.create();
> > +             for (int i = 0; i < sampleFiles.length; i++) {
> > +                     urlString = "file:" + sampleDir + fileSeparator +
> sampleFiles[i];
> > +
> > +                     protocol = new
> ProtocolFactory(conf).getProtocol(urlString);
> > +                     content = protocol.getProtocolOutput(new
> Text(urlString),
> > +                                     new CrawlDatum()).getContent();
> > +                     parse = new
> ParseUtil(conf).parseByExtensionId("parse-tika",
> > +                                     content).get(content.getUrl());
> > +
> > +                     // check that there are 2 outlinks:
> > +                     // unlike the original parse-rss
> > +                     // tika ignores the URL and description of the
> channel
> > +
> > +                     // http://test.channel.com
> > +                     // http://www-scf.usc.edu/~mattmann/
> > +                     // http://www.nutch.org
> > +
> > +                     ParseData theParseData = parse.getData();
> > +
> > +                     Outlink[] theOutlinks = theParseData.getOutlinks();
> > +
> > +                     assertTrue("There aren't 2 outlinks read!",
> > +                                     theOutlinks.length == 2);
> > +
> > +                     // now check to make sure that those are the two
> outlinks
> > +                     boolean hasLink1 = false, hasLink2 = false;
> > +
> > +                     for (int j = 0; j < theOutlinks.length; j++) {
> > +                             if (theOutlinks[j].getToUrl().equals(
> > +                                             "
> http://www-scf.usc.edu/~mattmann/";)) {
> > +                                     hasLink1 = true;
> > +                             }
> > +
> > +                             if (theOutlinks[j].getToUrl().equals("
> http://www.nutch.org/";)) {
> > +                                     hasLink2 = true;
> > +                             }
> > +                     }
> > +
> > +                     if (!hasLink1 || !hasLink2) {
> > +                             fail("Outlinks read from sample rss file
> are not correct!");
> > +                     }
> > +             }
> > +     }
> > +
> > +}
> >
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to