Came across this issue :0)

https://issues.apache.org/jira/browse/NUTCH-956

which seems to uncover all mystery with this one.

It also reminded me of this conversation recently [0]

I will test and get a JUnit case written before attaching new patch to
the issue.

[0] http://www.mail-archive.com/user%40nutch.apache.org/msg07272.html

On Sat, Aug 25, 2012 at 1:18 PM, Lewis John Mcgibbney
<[email protected]> wrote:
> Hi,
>
> I've had a random patch lying around one of my desktops for sometime.
>
> 1) schema.xml is straight foward enough
> 2) MoreIndexingFilter.java seems to be an issue of reliability
> (possibly). Maybe the Http Header  content information can be
> unreliable at times? Does anyone have an opinion on this? At the
> moment I am none-the-wiser but keen to gather views and/experiences.
> 3) Again in SolrWriter.java this may be an issue of reliability
> (accuracy?) regarding the proposed explicit equals cast check instead
> of the abitrary assignment check. Any thoughts?
>
> I did not produce this patch and can't remember how or why it ended up
> on my desktop! So apologies for the randomness of this one.
>
> Thanks
>
> Lewis
>
>
> Index: conf/schema.xml
> ===================================================================
> --- conf/schema.xml     (revision 1145734)
> +++ conf/schema.xml     (working copy)
> @@ -113,6 +113,8 @@
>          <!-- fields for creativecommons plugin -->
>          <field name="cc" type="string" stored="true" indexed="true"
>              multiValued="true"/>
> +
> +        <field name="tld" type="string" stored="false" indexed="false"/>
>      </fields>
>      <uniqueKey>id</uniqueKey>
>      <defaultSearchField>content</defaultSearchField>
>
> Index: 
> src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
> ===================================================================
> --- 
> src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>        (revision 1053817)
> +++ 
> src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>        (working copy)
> @@ -172,7 +172,7 @@
>     */
>    private NutchDocument addType(NutchDocument doc, WebPage page, String url) 
> {
>      MimeType mimeType = null;
> -    Utf8 contentType = page.getFromHeaders(new 
> Utf8(HttpHeaders.CONTENT_TYPE));
> +    Utf8 contentType = page.getContentType();
>      if (contentType == null) {
>        // Note by Jerome Charron on 20050415:
>        // Content Type not solved by a previous plugin
> Index: src/java/org/apache/nutch/indexer/solr/SolrWriter.java
> ===================================================================
> --- src/java/org/apache/nutch/indexer/solr/SolrWriter.java
> (revision 1053817)
> +++ src/java/org/apache/nutch/indexer/solr/SolrWriter.java      (working copy)
> @@ -56,7 +56,7 @@
>        for (final String val : e.getValue()) {
>          inputDoc.addField(solrMapping.mapKey(e.getKey()), val);
>          String sCopy = solrMapping.mapCopyKey(e.getKey());
> -        if (sCopy != e.getKey()) {
> +        if (! sCopy.equals(e.getKey())) {
>                 inputDoc.addField(sCopy, val);
>          }
>        }
>
> --
> Lewis



-- 
Lewis

Reply via email to