[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006118#comment-14006118
]
Tim Allison edited comment on TIKA-1302 at 5/22/14 4:50 PM:
------------------------------------------------------------
Ok, I think we might be talking about different things. For example, when I
pull the metadata out of 002454 with Tika 1.5, I see:
{noformat}
[{
dcterms:modified":["2004-05-26T15:31:39Z"],
"meta:creation-date":["2004-05-26T15:31:31Z"],
"meta:save-date":["2004-05-26T15:31:39Z"],
"dc:creator":["Slimjimbob"],
"Last-Modified":["2004-05-26T15:31:39Z"],
"Author":["Slimjimbob"],
"dcterms:created":["2004-05-26T15:31:31Z"],
date":["2004-05-26T15:31:39Z"],
"modified":["2004-05-26T15:31:39Z"],
"creator":["Slimjimbob"],
"xmpTPg:NPages":["1"],
"Creation-Date":["2004-05-26T15:31:31Z"],
"title":["CoverMay/June04.qxd"],
"meta:author":["Slimjimbob"],
"created":["Wed May 26 11:31:31 EDT 2004"],
"producer":["Acrobat Distiller 5.00 for Macintosh"],
"Content-Type":["application/pdf"],
"xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"],
"Last-Save-Date":["2004-05-26T15:31:39Z"],
"dc:title":["CoverMay/June04.qxd"]
}]
{noformat}
This includes more than is available here:
[ 002454 | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ]
Are you saying that there is no metadata truth set against which to evaluate or
are we using "metadata" to mean different things?
Thank you again, and I look forward to seeing your paper!
was (Author: [email protected]):
Ok, I think we might be talking about different things. For example, when I
pull the metadata out of 002454 with Tika 1.5, I see:
{noformat}
[{
dcterms:modified":["2004-05-26T15:31:39Z"],
"meta:creation-date":["2004-05-26T15:31:31Z"],
"meta:save-date":["2004-05-26T15:31:39Z"],
"dc:creator":["Slimjimbob"],
"Last-Modified":["2004-05-26T15:31:39Z"],
"Author":["Slimjimbob"],
"dcterms:created":["2004-05-26T15:31:31Z"],
date":["2004-05-26T15:31:39Z"],
"modified":["2004-05-26T15:31:39Z"],
"creator":["Slimjimbob"],
"xmpTPg:NPages":["1"],
"Creation-Date":["2004-05-26T15:31:31Z"],
"title":["CoverMay/June04.qxd"],
"meta:author":["Slimjimbob"],
"created":["Wed May 26 11:31:31 EDT 2004"],
"producer":["Acrobat Distiller 5.00 for Macintosh"],
"Content-Type":["application/pdf"],
"xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"],
"Last-Save-Date":["2004-05-26T15:31:39Z"],
"dc:title":["CoverMay/June04.qxd"]
}]
{noformat}
This includes more than is available here:
[ 002454 meta | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ]
Are you saying that there is no metadata truth set against which to evaluate?
Thank you again, and I look forward to seeing your paper!
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.2#6252)