On 23 March 2012 12:16, Szymon Danielczyk <[email protected]>wrote:
> Hi > Paragraph from their website > > "Our solution is to run (Java) regular expressions against each > webpages prior to extraction, which detect the presence of a > microformat in a HTML page, and then only run the Any23 extractor when > the regular expression find potentional matches." > > Are we using any technics like that to decide that there is anything > to parse in the document ? > Maybe we can build in such feature like a method/filter for users that > want to parse huge number of docs > to detect that the document is worth parsing > > They have the table with regex they used for each format > Any opinions about this > ATM we just elect a list of candidate Extractors on the basis of the page content MIMEtype, however we could introduce an optimization for massive processing based on pre detection of content within a page. Mic > > Szymon > > On 23 March 2012 10:38, Davide Palmisano <[email protected]> wrote: > > Thanks Michele, > > > > this is a great news. > > > > Should we have a section on the web site listing > > all the products/initiatives that are using Any23? > > > > On Fri, Mar 23, 2012 at 11:01 AM, Michele Mostarda > > <[email protected]> wrote: > >> Hi Guys, > >> > >> just a curiosity: > >> > >> Any23 has been recently used to parse the entire corpus of Semantic > >> Web Data existing on the Web [0]. > >> > >> The best. > >> > >> Mic > >> > >> [0] http://webdatacommons.org/ > >> > >> -- > >> Michele Mostarda > >> Senior Software Engineer > >> skype: michele.mostarda > >> twitter: micmos > >> mail: [email protected] > >> site : http://www.michelemostarda.com > > > > > > > > -- > > Davide Palmisano > > > > http://davidepalmisano.com > > http://twitter.com/dpalmisano > -- Michele Mostarda Senior Software Engineer skype: michele.mostarda twitter: micmos mail: [email protected] site : http://www.michelemostarda.com
