[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006118#comment-14006118
 ] 

Tim Allison edited comment on TIKA-1302 at 5/22/14 4:50 PM:
------------------------------------------------------------

Ok, I think we might be talking about different things.  For example, when I 
pull the metadata out of 002454 with Tika 1.5, I see: 
{noformat}
[{
dcterms:modified":["2004-05-26T15:31:39Z"],
"meta:creation-date":["2004-05-26T15:31:31Z"],
"meta:save-date":["2004-05-26T15:31:39Z"],
"dc:creator":["Slimjimbob"],
"Last-Modified":["2004-05-26T15:31:39Z"],
"Author":["Slimjimbob"],
"dcterms:created":["2004-05-26T15:31:31Z"],
date":["2004-05-26T15:31:39Z"],
"modified":["2004-05-26T15:31:39Z"],
"creator":["Slimjimbob"],
"xmpTPg:NPages":["1"],
"Creation-Date":["2004-05-26T15:31:31Z"],
"title":["CoverMay/June04.qxd"],
"meta:author":["Slimjimbob"],
"created":["Wed May 26 11:31:31 EDT 2004"],
"producer":["Acrobat Distiller 5.00 for Macintosh"],
"Content-Type":["application/pdf"],
"xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"],
"Last-Save-Date":["2004-05-26T15:31:39Z"],
"dc:title":["CoverMay/June04.qxd"]
}]
{noformat}

This includes more than is available here:
[ 002454  | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ]

Are you saying that there is no metadata truth set against which to evaluate or 
are we using "metadata" to mean different things?  

Thank you again, and I look forward to seeing your paper!


was (Author: talli...@mitre.org):
Ok, I think we might be talking about different things.  For example, when I 
pull the metadata out of 002454 with Tika 1.5, I see: 
{noformat}
[{
dcterms:modified":["2004-05-26T15:31:39Z"],
"meta:creation-date":["2004-05-26T15:31:31Z"],
"meta:save-date":["2004-05-26T15:31:39Z"],
"dc:creator":["Slimjimbob"],
"Last-Modified":["2004-05-26T15:31:39Z"],
"Author":["Slimjimbob"],
"dcterms:created":["2004-05-26T15:31:31Z"],
date":["2004-05-26T15:31:39Z"],
"modified":["2004-05-26T15:31:39Z"],
"creator":["Slimjimbob"],
"xmpTPg:NPages":["1"],
"Creation-Date":["2004-05-26T15:31:31Z"],
"title":["CoverMay/June04.qxd"],
"meta:author":["Slimjimbob"],
"created":["Wed May 26 11:31:31 EDT 2004"],
"producer":["Acrobat Distiller 5.00 for Macintosh"],
"Content-Type":["application/pdf"],
"xmp:CreatorTool":["QuarkXPress. 4.04: LaserWriter 8 8.7.1"],
"Last-Save-Date":["2004-05-26T15:31:39Z"],
"dc:title":["CoverMay/June04.qxd"]
}]
{noformat}

This includes more than is available here:
[ 002454 meta | http://digitalcorpora.org/cgi-bin/info.cgi?docid=002454 ]

Are you saying that there is no metadata truth set against which to evaluate?  

Thank you again, and I look forward to seeing your paper!

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to