As for metadata it seems to me that there are three sets of metadata:
1.) Protocol metadata
2.) Media-type metadata (MS Doc properties, HTML meta tags (Dublin Core and so on)
3.) Stuff we add (e.g. truncate=true).
Would it be worth representing it this way with three separate properties object.
We could go further and have a hashmap of properties files and have unlimited metadata sets but I would rather wait and see if we need it.
What do you think? I would love to see this in the next release so I could baseline my code more easily.
Andy
[EMAIL PROTECTED] wrote:
Hi, Andy,
There are "Dates" in the properties: Last Save Date, Creation Date, etc. Are they GMT? We might want to save them in the same format as that of http header Last-Modified.
I am also thinking, probably, property names had better be hyphenated (be consistent with other MetaData) and prefixed with "x-" or similar (a simple way to give them a separate name space, might be useful to furhter processing downstream).
Thanks,
John
On Thu, Aug 05, 2004 at 03:44:56AM -0700, SourceForge.net wrote:
Bugs item #999549, was opened at 2004-07-28 15:47
Message generated for change (Comment added) made by andyhedges
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=999549&group_id=59548
Category: plugin: other Group: None Status: Open Resolution: None Priority: 5 Submitted By: Andy Hedges (andyhedges) Assigned to: Nobody/Anonymous (nobody) Summary: MSWord document's title
Initial Comment: MSWord document titles weren't being extracted and stored. This patch does that by extracting the title from the documents "properties".
----------------------------------------------------------------------
Comment By: Andy Hedges (andyhedges)
Date: 2004-08-05 10:44
Message:
Logged In: YES user_id=583029
altered to take on board some feedback regarding patch file creation.
----------------------------------------------------------------------
Comment By: Andy Hedges (andyhedges) Date: 2004-08-03 16:32
Message:
Logged In: YES user_id=583029
Removed some unnecessary debug.
----------------------------------------------------------------------
Comment By: Andy Hedges (andyhedges) Date: 2004-08-03 15:41
Message:
Logged In: YES user_id=583029
Updated to neaten patch file and to include all MS Word properties.
----------------------------------------------------------------------
Comment By: Andy Hedges (andyhedges) Date: 2004-07-29 09:06
Message:
Logged In: YES user_id=583029
After doing some extensive test on this I have discovered that occasionally Word 'Streams' don't have the SummaryInformation documents in them. This apparently happens when a word doc is opened in StarOffice (or I imagine OO.o) and saved out again.
Anyway this new patch sets a timeout on the listener and if no SummaryInformation is found sets the title to the empty string.
This seems a bit complicated to extract a title from a document but this maybe due to the nature of the format or the api. Could someone who is familiar with POI and the Apache api please comment?
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=999549&group_id=59548
------------------------------------------------------- This SF.Net email is sponsored by OSTG. Have you noticed the changes on Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, one more big change to announce. We are now OSTG- Open Source Technology Group. Come see the changes on the new OSTG site. www.ostg.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
__________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today!
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- This SF.Net email is sponsored by OSTG. Have you noticed the changes on Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, one more big change to announce. We are now OSTG- Open Source Technology Group. Come see the changes on the new OSTG site. www.ostg.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
