[
https://issues.apache.org/jira/browse/NUTCH-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693456#comment-14693456
]
Pradumna Panditrao commented on NUTCH-2079:
-------------------------------------------
Hi,
1.In current case getParse parses url & page. But I want to pass particular
data etc. if page contains name, age, location etc. So guide for the same.
2. Once I come to know the exact parse contain as per my requirement, I will
make the same changes index-plugin.
3.Yes, I have added the same into gora-mongodb-mapping.xml
So let me know the same.
Sample code of mine:
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
//Phone number extractor
PhoneExtractingContentHandler handler = new
PhoneExtractingContentHandler(new BodyContentHandler(), metadata);
InputStream stream = new FileInputStream(file);
try {
parser.parse(stream, handler, metadata, new ParseContext());
}
finally {
stream.close();
}
String[] numbers = metadata.getValues("phonenumbers");
for (String number : numbers) {
phoneNumbers.add(number);
}
}
> Tika Parsing plugin issue
> -------------------------
>
> Key: NUTCH-2079
> URL: https://issues.apache.org/jira/browse/NUTCH-2079
> Project: Nutch
> Issue Type: New Feature
> Components: deployment
> Affects Versions: 2.3
> Environment: Ubuntu 14.04
> Reporter: Pradumna Panditrao
> Fix For: 2.3
>
>
> Hi,
> I am trying to parse particular data & post the same on the mongodb, however
> when I am trying to do some modifications into into parse tika plugin, it has
> too much inter connectivity with other classes & it misses the data. I want
> to pick up particular data from website using the same plugin & put into
> mongo db.
> Please suggest for the same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)