Reading additional metadata field: mtdt:_hr_

2013-08-14 Thread Ahmet Emre Aladağ
Hi,


I added additional mtdt:_hr_ records in HBase holding scores externally.

To get the score stored in mtdt:_hr_, 
In Nutch 2.1 IndexUtil, I'd like to use:

HostDb hostDb = new HostDb(conf);
Host host = hostDb.getByHostName("http://www.google.com";);
host.getFromMetaData(new Utf8("_hr_"));

But it returns null although these records exist in the table. Metadata holds 
only [f, p] keys, not _hr_.

Should I specify this additional metadata key (qualifier) somewhere?

Thanks,




[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739978#comment-13739978
 ] 

Lewis John McGibbney commented on NUTCH-1294:
-

sorry I meant changes.txt

https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731]
you mean I will also change something in
https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :)
NUTCH-1294-v3.patch
was inadequate for our needs. We needed to mark particular pages as gone
even though they still might be visible on the web, this implementation
abstracts the index cleaning process, has a Solr implementation, and adds a
clean index plugin extension that allows others to tailor how pages might
be removed from their store.
administrators

-- 
*Lewis*


> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Converting HTML text in org.apache.nutch.protocol.Content to String

2013-08-14 Thread feng lu
Hi byte

you can use EncodingDetector util to detect character encodings. and then
use tagsoup or Neko to parse the html. you can check the source code of
parse-html plugin. some code like this:

=

 byte[] contentInOctets = content.getContent();
  InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

  EncodingDetector detector = new EncodingDetector(conf);
  detector.autoDetectClues(content, true);
  detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
  String encoding = detector.guessEncoding(content,
defaultCharEncoding);

  metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
  metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

  input.setEncoding(encoding);
  if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
  root = parse(input);


-- 
Don't Grow Old, Grow Up... :-)


[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.

2013-08-14 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731
 ] 

lufeng commented on NUTCH-1294:
---

Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you 
mean I will also change something in 
https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :)

> IndexClean job with solr implementation.
> 
>
> Key: NUTCH-1294
> URL: https://issues.apache.org/jira/browse/NUTCH-1294
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Dan Rosher
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, 
> NUTCH-1294-v3.patch
>
>
> I started by copying/altering the trunk version of SolrClean, though is was 
> inadequate for our needs. We needed to mark particular pages as gone even 
> though they still might be visible on the web, this implementation abstracts 
> the index cleaning process, has a Solr implementation, and adds a clean index 
> plugin extension that allows others to tailor how pages might be removed from 
> their store.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira