Re: http://webdatacommons.org/

Michele Mostarda Fri, 23 Mar 2012 09:42:21 -0700

On 23 March 2012 12:16, Szymon Danielczyk <[email protected]>wrote:


> Hi
> Paragraph from their website
>
> "Our solution is to run (Java) regular expressions against each
> webpages prior to extraction, which detect the presence of a
> microformat in a HTML page, and then only run the Any23 extractor when
> the regular expression find potentional matches."
>
> Are we using any technics like that to decide that there is anything
> to parse in the document ?
> Maybe we can build in such feature like a method/filter for users that
> want to parse huge number of docs
> to detect that the document is worth parsing
>
> They have the table with regex they used for each format
> Any opinions about this
>

ATM we just elect a list of candidate Extractors on the basis of the
page content MIMEtype, however we could introduce an optimization
for massive processing based on pre detection of content within a page.

Mic


>
> Szymon
>
> On 23 March 2012 10:38, Davide Palmisano <[email protected]> wrote:
> > Thanks Michele,
> >
> > this is a great news.
> >
> > Should we have a section on the web site listing
> > all the products/initiatives that are using Any23?
> >
> > On Fri, Mar 23, 2012 at 11:01 AM, Michele Mostarda
> > <[email protected]> wrote:
> >> Hi Guys,
> >>
> >>   just a curiosity:
> >>
> >>    Any23 has been recently used to parse the entire corpus  of Semantic
> >> Web Data existing on the Web [0].
> >>
> >> The best.
> >>
> >> Mic
> >>
> >> [0] http://webdatacommons.org/
> >>
> >> --
> >> Michele Mostarda
> >> Senior Software Engineer
> >> skype: michele.mostarda
> >> twitter: micmos
> >> mail: [email protected]
> >> site : http://www.michelemostarda.com
> >
> >
> >
> > --
> > Davide Palmisano
> >
> > http://davidepalmisano.com
> > http://twitter.com/dpalmisano
>



-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: [email protected]
site : http://www.michelemostarda.com

Re: http://webdatacommons.org/

Reply via email to