Hi Sebastian,

I'm taking this over to [email protected] to discuss there.
We know what we want over here @ Nutch, we want to utilise the Any23
parsers to scrape additional structured information from webpages and such
like. However as you mention the subsequent task of presenting them
(HtmlIndexingFilter) is not quite as straightforward, it gets more tricky
when you begin to take into account the growing range of formats that Any23
is able to extract, these not only differ in syntax but also in semantic
representation.

I'll get you over on any23 lists. Thanks

Lewis

On Tue, Apr 17, 2012 at 11:19 PM, Sebastian Nagel <
[email protected]> wrote:

> >> Well, we could easily use certain microdata key/value pairs in our
> results
> >> to greatly improve search and navigation.
>
> Microdata is a good show-case for the Any23 plugin.
>
> Another example would be semantic markup in shops.
> Any23 already does a good job in extracting the semantic content:
>
>  $ any23tools Rover \
>   'http://www.shopforia.com/cgi-**bin/apf4/apf4.cgi?Operation=**
> ItemLookup&ItemId=B007P4VOWC<http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC>
> '
>
> The question is how to map triples to key-value pairs (NutchFields)
> in a straight-forward but configurable way.
> The triples
>  <#Offering_0635753498301> <#hasPriceSpecification>
> <#UnitPriceSpecification> .
>  <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
>        <#hasCurrency> "USD"^^<#string> ;
>        ... .
> and the pair
>  price = 249.99 USD
> are the same information. Nutch (or Solr etc.) require the latter form
> if you want to set up a shop search. But conversion is not as simple
> (maybe I'm wrong?):
>  - information may be spread over several triples
>  - there may be multiple products per document
>   (same predicate for different subjects) => use sub-documents?
>
> Sebastian
>
>
> On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:
>
>> Hi Markus,
>>
>> On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
>> <[email protected]>**wrote:
>>
>>  You did indeed suggest that. However, if building a wrapper is fairly
>>> straightforward then it may not be a bad idea. I haven't seen any hint of
>>> Tika
>>> having Any23 on-board any time soon so we might have to wait a very long
>>> time
>>> if we want to rely on Tika.
>>>
>>>
>> Yeah +1. As I explained to Julien we are some way from thinking about
>> integration into Tika and subsequently writing the parser
>> implementation(s)
>> for use within Tika.
>>
>>
>>
>>>
>>> Well, we could easily use certain microdata key/value pairs in our
>>> results
>>> to
>>> greatly improve search and navigation.
>>>
>>>
>> Yeah, microdata is just one from a whole bunch of formats Any23 can
>> handle.
>> My reservations were how to represent the many different formats in a way
>> which would be easily navigable (is that a word?) within an index. There
>> is
>> obviously work to be done here from my side.
>>
>> Thanks
>>
>> Lewis
>>
>>
>


-- 
*Lewis*

Reply via email to