Hi, makes sense. I used the same approach recently to speed up an ingestion process with Any23 at the end of the pipeline.
my 2 cents On Fri, Mar 23, 2012 at 12:16 PM, Szymon Danielczyk <[email protected]> wrote: > Hi > Paragraph from their website > > "Our solution is to run (Java) regular expressions against each > webpages prior to extraction, which detect the presence of a > microformat in a HTML page, and then only run the Any23 extractor when > the regular expression find potentional matches." > > Are we using any technics like that to decide that there is anything > to parse in the document ? > Maybe we can build in such feature like a method/filter for users that > want to parse huge number of docs > to detect that the document is worth parsing > > They have the table with regex they used for each format > Any opinions about this > > Szymon > > On 23 March 2012 10:38, Davide Palmisano <[email protected]> wrote: >> Thanks Michele, >> >> this is a great news. >> >> Should we have a section on the web site listing >> all the products/initiatives that are using Any23? >> >> On Fri, Mar 23, 2012 at 11:01 AM, Michele Mostarda >> <[email protected]> wrote: >>> Hi Guys, >>> >>> just a curiosity: >>> >>> Any23 has been recently used to parse the entire corpus of Semantic >>> Web Data existing on the Web [0]. >>> >>> The best. >>> >>> Mic >>> >>> [0] http://webdatacommons.org/ >>> >>> -- >>> Michele Mostarda >>> Senior Software Engineer >>> skype: michele.mostarda >>> twitter: micmos >>> mail: [email protected] >>> site : http://www.michelemostarda.com >> >> >> >> -- >> Davide Palmisano >> >> http://davidepalmisano.com >> http://twitter.com/dpalmisano -- Davide Palmisano http://davidepalmisano.com http://twitter.com/dpalmisano
