Hi Guys, A rather interesting discussion has emerged over on dev@nutch regarding building the Any23 Nutch plugin[0], please see Sebastian Nagel's comments below for the most recent contribution... which got me thinking more about it today. I would advise you to maybe read over the short conversation before reading on as it's better in context :0)
The overwhelming majority of Nutch users build searchable Solr indexes from the content they retrieve via Nutch, therefore we're looking to build a plugin solution which does a double task 1) Tika wrapped Any23 Parser plugin - enabling us to use core Any23 parsers for extraction. 2) An HtmlIndexingFilter - enabling us to process the triples and to get them into a Solr index in such a way which is easily searchable via fields. As we discussed and as Sebastian graphically highlights below, this is not clear cut, therefore I wanted to hear anyones thoughts/input on building 2) before I begin. Thanks in advance Lewis [0] http://www.mail-archive.com/dev%40nutch.apache.org/msg07104.html ---------- Forwarded message ---------- From: Sebastian Nagel <[email protected]> Date: Tue, Apr 17, 2012 at 11:19 PM Subject: Re: NUTCH-1129 To: [email protected] >> Well, we could easily use certain microdata key/value pairs in our results >> to greatly improve search and navigation. Microdata is a good show-case for the Any23 plugin. Another example would be semantic markup in shops. Any23 already does a good job in extracting the semantic content: $ any23tools Rover \ 'http://www.shopforia.com/cgi-**bin/apf4/apf4.cgi?Operation=** ItemLookup&ItemId=B007P4VOWC<http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC> ' The question is how to map triples to key-value pairs (NutchFields) in a straight-forward but configurable way. The triples <#Offering_0635753498301> <#hasPriceSpecification> <#UnitPriceSpecification> . <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ; <#hasCurrency> "USD"^^<#string> ; ... . and the pair price = 249.99 USD are the same information. Nutch (or Solr etc.) require the latter form if you want to set up a shop search. But conversion is not as simple (maybe I'm wrong?): - information may be spread over several triples - there may be multiple products per document (same predicate for different subjects) => use sub-documents? Sebastian On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote: > Hi Markus, > > On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma > <[email protected]>**wrote: > > You did indeed suggest that. However, if building a wrapper is fairly >> straightforward then it may not be a bad idea. I haven't seen any hint of >> Tika >> having Any23 on-board any time soon so we might have to wait a very long >> time >> if we want to rely on Tika. >> >> > Yeah +1. As I explained to Julien we are some way from thinking about > integration into Tika and subsequently writing the parser implementation(s) > for use within Tika. > > > >> >> Well, we could easily use certain microdata key/value pairs in our results >> to >> greatly improve search and navigation. >> >> > Yeah, microdata is just one from a whole bunch of formats Any23 can handle. > My reservations were how to represent the many different formats in a way > which would be easily navigable (is that a word?) within an index. There is > obviously work to be done here from my side. > > Thanks > > Lewis > > -- *Lewis*
