[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006118#comment-14006118 ]
Tim Allison edited comment on TIKA-1302 at 5/22/14 4:50 PM: ------------------------------------------------------------ Ok, I think we might be talking about different things. For example, when I pull the metadata out of 002454 with Tika 1.5, I see: {noformat} [{ dcterms:modified":["2004-05-26T15:31:39Z"], "meta:creation-date":["2004-05-26T15:31:31Z"], "meta:save-date":["2004-05-26T15:31:39Z"], "dc:creator":["Slimjimbob"], "Last-Modified":["2004-05-26T15:31:39Z"], "Author":["Slimjimbob"], "dcterms:created":["2004-05-26T15:31:31Z"], date":["2004-05-26T15:31:39Z"], "modified":["2004-05-26T15:31:39Z"], "creator":["Slimjimbob"], "xmpTPg:NPages":["1"], "Creation-Date":["2004-05-26T15:31:31Z"], "title":["CoverMay/June04.qxd"], "meta:author":["Slimjimbob"], "created":["Wed May 26 11:31:31 EDT 2004"], "producer":["Acrobat Distiller 5.00 for Macintosh"], "Content-Type":["application/pdf"], "xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"], "Last-Save-Date":["2004-05-26T15:31:39Z"], "dc:title":["CoverMay/June04.qxd"] }] {noformat} This includes more than is available here: [ 002454 | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ] Are you saying that there is no metadata truth set against which to evaluate or are we using "metadata" to mean different things? Thank you again, and I look forward to seeing your paper! was (Author: talli...@mitre.org): Ok, I think we might be talking about different things. For example, when I pull the metadata out of 002454 with Tika 1.5, I see: {noformat} [{ dcterms:modified":["2004-05-26T15:31:39Z"], "meta:creation-date":["2004-05-26T15:31:31Z"], "meta:save-date":["2004-05-26T15:31:39Z"], "dc:creator":["Slimjimbob"], "Last-Modified":["2004-05-26T15:31:39Z"], "Author":["Slimjimbob"], "dcterms:created":["2004-05-26T15:31:31Z"], date":["2004-05-26T15:31:39Z"], "modified":["2004-05-26T15:31:39Z"], "creator":["Slimjimbob"], "xmpTPg:NPages":["1"], "Creation-Date":["2004-05-26T15:31:31Z"], "title":["CoverMay/June04.qxd"], "meta:author":["Slimjimbob"], "created":["Wed May 26 11:31:31 EDT 2004"], "producer":["Acrobat Distiller 5.00 for Macintosh"], "Content-Type":["application/pdf"], "xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"], "Last-Save-Date":["2004-05-26T15:31:39Z"], "dc:title":["CoverMay/June04.qxd"] }] {noformat} This includes more than is available here: [ 002454 meta | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ] Are you saying that there is no metadata truth set against which to evaluate? Thank you again, and I look forward to seeing your paper! > Let's run Tika against a large batch of docs nightly > ---------------------------------------------------- > > Key: TIKA-1302 > URL: https://issues.apache.org/jira/browse/TIKA-1302 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > > Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and > running again, it might be fun to run Tika regularly against a large set of > docs and report metrics. > One excellent candidate corpus is govdocs1: > http://digitalcorpora.org/corpora/files. > Any other candidate corpora? > [~willp-bl], have anything handy you'd like to contribute? > [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] > ;) -- This message was sent by Atlassian JIRA (v6.2#6252)