>> Well, we could easily use certain microdata key/value pairs in our results
>> to greatly improve search and navigation.
Microdata is a good show-case for the Any23 plugin.
Another example would be semantic markup in shops.
Any23 already does a good job in extracting the semantic content:
$ any23tools Rover \
'http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC'
The question is how to map triples to key-value pairs (NutchFields)
in a straight-forward but configurable way.
The triples
<#Offering_0635753498301> <#hasPriceSpecification> <#UnitPriceSpecification> .
<#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
<#hasCurrency> "USD"^^<#string> ;
... .
and the pair
price = 249.99 USD
are the same information. Nutch (or Solr etc.) require the latter form
if you want to set up a shop search. But conversion is not as simple
(maybe I'm wrong?):
- information may be spread over several triples
- there may be multiple products per document
(same predicate for different subjects) => use sub-documents?
Sebastian
On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:
Hi Markus,
On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
<[email protected]>wrote:
You did indeed suggest that. However, if building a wrapper is fairly
straightforward then it may not be a bad idea. I haven't seen any hint of
Tika
having Any23 on-board any time soon so we might have to wait a very long
time
if we want to rely on Tika.
Yeah +1. As I explained to Julien we are some way from thinking about
integration into Tika and subsequently writing the parser implementation(s)
for use within Tika.
Well, we could easily use certain microdata key/value pairs in our results
to
greatly improve search and navigation.
Yeah, microdata is just one from a whole bunch of formats Any23 can handle.
My reservations were how to represent the many different formats in a way
which would be easily navigable (is that a word?) within an index. There is
obviously work to be done here from my side.
Thanks
Lewis