Hello Ferran! Am 2013-10-14 um 16:12 schrieb Ferran Jorba <[email protected]>:
> First of all, please, please, make sure that those records have an > unique identifier. You'll find yourself loading and reloading them, > either on your test or your production server. This unique identifier > will allow Invenio to replace the older version with the new one reusing > the same Invenio identifier. You can find a similar thread here: My idea was to export the same format with the Invenio ID added to use for corrections, see my workflow description in attached documents (if the list lets them through). I wrote a Python script to walk through a directory of media and extract as much metadata as possible. This writes the inital metadata file that gets enhanced/corrected manually by the team. Afterwards we'll inject it into Invenio. At the moment our different sources are just hinted in the file path; the team yet has to define what part I should interpret as Source/Publisher and what default values other fields may get. Here are some typical raw samples, full of errors: DATE 2013-08-05 FILE invenio-data/BIOM/igraem_na_prirode_i_reshaem_zadachi_po_ekologii_rasteniy-ecodelo.pdf LEN 20 p. SIZE 210.0×297.0mm TITLE Igraem Na Prirode I Reshaem Zadachi Po Ekologii Rasteniy-Ecodelo TITLE ИГРАЕМ НА ПРИРОДЕ И РЕШАЕМ ЗАДАЧИ ПО ЭКОЛОГИИ РАСТЕНИЙ AUT Администратор DATE 2009-06-22 FILE invenio-data/literatura.kg/children_KG/doschul_bala_zhany.pdf LEN 178 p. SIZE 148.2×209.9mm TITLE Doschul Bala Zhany TITLE Досчул бала. AUT <F2E0E1FBEBE4FB> DATE 2009-02-02 FILE invenio-data/literatura.kg/children_KG/togo_ber_zhamgyr.pdf LEN 46 p. SIZE 210.0×297.0mm TITLE Togo Ber Zhamgyr TITLE <443A5C323030395CC6E0EAE5E3E55CC0E4E0F8EAE0ED5CC0E4E0F8EAE0ED20> AUT Victor AUT Aline DATE 2013-09-05 FILE invenio-data/literatura.kg/children_RU/er_tjoshtuk_skazka.doc TITLE Er Tjoshtuk Skazka TITLE Эр Тюштюк AUT Fyodor Dostoevsky CAT Speech DATE 2007 FILE invenio-data/gutenberg.org/Russian/21183-01.mp3 LEN 00:05:52.716 SER White Nights TITLE 21183-01 TITLE 7 - Morning AUT Rachinskii, Sergei Aleksandrovich DATE 2005-08-14 DATE 2013-03-13T12:07:30.288832+00:00 FILE invenio-data/gutenberg.org/Russian/16527.epub KEY Word problems (Mathematics) KEY Mathematics -- Problems, exercises, etc. LANG rus LIC Public domain in the USA. REFNO http://www.gutenberg.org/ebooks/16527 TITLE 16527 TITLE 1001 задача для умственного счета AUT Чехов, Антон Павлович FILE invenio-data/pocketbook-int.com/Russian/Chehov, Anton - Djadja Vanja.epub KEY Dramaturgy KEY Russian Literature KEY Classic Literature LANG rus REFNO urn:uuid:64425a11-346d-491b-9dfb-63ef1253a31c TITLE Chehov, Anton - Djadja Vanja TITLE Дядя Ваня > The second lesson is that, if you end up using «simple» office > applications, I think that repeteable values are better solved with a > known character, like the semicolon you seem to use, rather than > multiple columns, because you can repeat the value zero or more times > without wasting fields. We resolved to several lines with the same key code. I didn’t try yet how that works with BibConvert. > And finally, I don't know how flexible is BibConvert, as I don't use it, > but if you feel confortable with Python, probably in the long run it > will pay to invest on it, as probably you'll have to fiddle with some > subtle cases where the flexibilty of a real programming language will > help you. I hope I'll get away with coding all the exceptions into my dirty little script and use a clean BibConvert setup after that. >>> For a unified search you'd like to have I think at least >>> Invenio 1.2 (current head master, if Tibor managed to merge >>> authority based searches yet). >> >> Grmbl, I forked at v1.1.2.473-1ab71 and decoupled. But I must learn >> how to manage upstream changes in git anyway. At the moment, I set my >> personal server repository as origin, develop on my Mac and pull >> changes from the development web server. All my changes >> (e.g. configuration, web style, docs) are in a personal branch, so it >> should cause no problems to pull master. > > You may take a look at guilt. I did a brief introduction during last > year Invenio Users Meeting that I hope to expand this november. The > slides are here > http://ddd.uab.cat/record/93913 Thanks for the hint! But I'll try to learn git better first. Greetlings, Hraban Grüßlinge, Hraban --- http://www.fiee.net https://www.cacert.org (I'm an assurer)
INVENIO WebSubmit Data Elements for UCA eBilim
by Henning Hraban Ramm, version 2013–10–14.
for UCA eBilim project.
(c) UCA 2013. License: GNU Free Documentation License.
Preface
Elements define the field types for WebSubmit Document Types.
- (key code in metadata file) = (data element name)
- (field type of data element)
- (Marc code)
Please find descriptions and examples in metadata_format_uca.
Title
Subtitle
Series
Author(s)
- AUT = UCA_AUTHORS
- List (one author per line)
- Format: Last name, first names fatherâs name [Alias]
- MARC: 100__a, 700__a
Description/Abstract
Remarks
Internal comments, e.g. about source, license
Date
- DATE = UCA_DATE
- Text (one line), max. length 10
- preferably ISO date format = yyyy-mm-dd, use yyyy or yyyy-mm or some fuzzy description like “18th century” as appropriate
- MARC: 260__c
Location
Language
Reference number
External reference number, e.g. order number, shelf number
ISBN
Only for books.
ISSN
Only for magazines.
Source
Publisher
License
Keywords
Length
Number of pages or run time in minutes
Size
Physical dimensions of the medium, like page format or pixel size.
Timestamp
Automatically filled on update
Metadata exchange format for UCA eBilim
by Henning Hraban Ramm, version 2013–10–14.
for UCA eBilim project.
(c) UCA 2013. License: GNU Free Documentation License.
Preface
This document defines a format for metadata exchange between the UCA eBilim team and the contractor for conversion and upload to Invenio, as agreed upon on 2013–10–11.
We use a simple plain text format to collect metadata for our media files. This gets converted to MarcXML, Invenioâs native import format.
Sample
FILE Literature/Educational/16527.epub
CAT Educational;Math
TITLE 1001 задаÑа Ð´Ð»Ñ ÑмÑÑвенного ÑÑеÑа
AUT Rachinskij, Sergej Aleksandrovich
DESC A collection of mathematical puzzles
LANG rus
DATE 1899
PUBL Project Gutenberg
SRC gutenberg.org
LIC PGL
REFNO 16527
KEY sadanye
KEY puzzle
# authorâs name and description should be in Russian
ID 123454
FILE Misc/Data/next.pdf
CAT Literature
CAT Children;Something;Subcategory
TITLE Some other book
SUB Tales from the playground
DESC This is just a silly test entry that should show some features of the input format.
AUT Ramm, Henning Hraban
AUT Rosset, Aline
LANG eng
LANG deu
LEN 256 Ñ.
DATE 2013-10
LOC Bishkek
PUBL édition fiëé
SRC Authors
LIC PD
KEY obsolete
KEY just a test
Definitions
- The input file is a plain text file, encoded in UTF–8. We can accept documents in MS Word or OpenDocument (OpenOffice/LibreOffice) formats.
- There will be different files per media type (Invenio: collection), e.g. Video, Book, Audio.
- A block of metadata (a record) describing one media file consists of definition lines without empty lines. The number of lines per record can vary.
- Each line defines one metadata item. It starts with an uppercase key code and one or more tabulator characters. Donât use space characters after the key code!
- Empty lines delimit records.
- The order of lines in a record doesnât matter. Exception: first
AUTline is the primary author, furtherAUTlines get secondary authors. - Lines can get as long as needed. Soft line breaking of your word processor or text editor is no problem, but avoid hard line breaks!
- Comment lines start with a
#(hash, fence, number sign), theyâre ignored by the processing program. Use them for temporary comments, they donât end up in the database. You can use comment lines also for optical structuring of input files or markers like “### Here I stopped working!”. - Generally, use cyrillic script for Russian, Kyrgyz etc., no transliteration!
- “Marc” denotes the corresponding Marc field number, it doesnât matter for members of the team.
Workflow
Initial Upload
- UCA team (Aline) sorts collected media data on an external harddrive.
- Contractor (Hraban) prepares raw metadata files from this file/directory structure, just containing the
FILElines and possibly mechanically extractable data (like titles from PDFs), and sends them to UCA team. - UCA team fills metadata files and sends them back to contractor.
- Contractor checks these files and suggests/requests corrections.
- UCA team makes corrections.
- Contractor converts metadata to MarcXML and uploads into Invenio server.
Corrections / Maintenance
- Contractor exports metadata from Invenio database, converts it to the here defined format and sends to UCA team. (These files will contain
IDlines.) - UCA team makes corrections, leaving
IDlines unchanged, and sends back to contractor. - Contractor checks, converts and uploads new metadata; unchanged data will get ignored, changed data will overwrite old values.
Metadata keys
ID
This is an Invenio-internal unique media identification number, never change it! It will appear only on exported data (see section “workflow”). Donât try to provide it yourself!
MARC: 001 // DE: ?
FILE
Relative file path for import/upload. The root directory of this is to be defined elsewhere.
The file name may contain spaces, but please avoid spaces in directory names! File or directory names must not start with spaces! Unicode file or directory names (e.g. Russian/Kyrgyz) are valid.
Path separator is always a forward slash (/, Unix style), not a backslash (\, Windows style).
FILE may appear several times if the media consists of several files (e.g. audio books).
FILE may be left out, if there is no file to upload (e.g. entry for paper books or DVDs).
MARC: FFT__a // DE: ?
Example:
FILE Children/Audio/Fairytales/1200001.mp3
FILE Children/Audio/Fairytales/1200002.mp3
FILE Children/Audio/Fairytales/1200003.mp3
CAT
Category, as defined by taxonomy. Hierarchical categories are separated by a colon (;).
CAT may appear several times to put a media into several categories.
MARC: ? // DE: ?
Example:
CAT Children;Entertainment
CAT Entertainment;Folklore;Fairytales
TITLE
Main title or name of the media.
TITLE must appear at least once.
It may appear several times if the media has several titles of the same level, e.g. in different languages.
If several titles are hierarchical, use SUBT.
With audio books that consist of several chapters, use this for the main title (not the chapter title).
(We didnât use the short form TIT to avoid offending anyone ;-)
Example:
TITLE Baba Yaga
SUBT
Subtitle of the media.
SUBT may appear several times, e.g. for subtitles in different languages.
Several subtitles of the same level and in the same language should use only one SUBT line.
With audio books that consist of several chapters, use this for the chapter title.
MARC: 245__b // DE: UCA_SUBTITLE
Example:
SUBT The hut on chicken feet
AUT
Author. Use the form “Last name, first name(s) [Alias]”, as applicable.
AUT may appear several times for several authors. The first AUT line is stored as primary author, further AUT lines as secondary authors.
MARC: 100__a, 700__a // DE: UCA_AUTHOR
Example:
AUT Andersen, Christian
AUT Uljanow, Wladimir Iljitsch [Lenin]
SER
Series title; also usable for e.g. album name of songs.
MARC: 490__a // DE: UCA_SERIES
Examples:
SER Live in Bishkek
SER A Song of Ice and Fire
DESC
Description or abstract.
DESC should appear only once. Several DESC lines may get combined or ignored…
MARC: 520__a // DE: UCA_DESCRIPTION
Example:
DESC A long and winding novel about a man and a woman, that meet somewhere in Central Asia.
LANG
Main language of the media.
Use only three-letter codes from ISO 639–3, see e.g. SIL or Ethnologue
rus= Russiankir= Kyrgyzuzn= Uzbek (actually North Uzbek, as spoken in Kyrgyzstan)tgk= Tadjikdng= Dunganeng= English (no further specification)deu= Germanfra= French
LANG may appear several times if the media is multilingual.
MARC: 041__a // DE: UCA_LANGUAGE
Example:
LANG rus
KEY
Keyword or key phrase.
One word/phrase per line. No translations.
MARC: 653__a // DE: UCA_KEYWORDS
Example:
KEY unemployment
KEY on the dole
REFNO
Foreign reference number, i.e. number of this media in original project or DOI or order number etc.
Example:
REFNO 10.1000/182
ISBN or ISSN
International Standard Book/Serial Number
MARC: 020__a / 021__a // DE: UCA_ISBN, UCA_ISSN
Example:
ISSN 0361-526X
ISBN 978-3-86680-192-9
SRC
Source, i.e. where or from whom you got the file, e.g. name of organization or URL of website
MARC: 541__a // DE: UCA_SOURCE
Examples:
SRC gutenberg.org
SRC Authorâs widow
PUBL
Publisher / publishing organization
MARC: 260__b // DE: UCA_PUBLISHER
Example:
PUBL University of Central Asia
LIC
Name or abbreviation of License
Note that you must regard any media as copyrighted if you donât know better!
You can either write the full name of the license or better use an abbreviation from this list:
C= © CopyrightCP= © Copyright, but with permission (i.e. the copyright holder allowed to use the media for eBilim)PD= Public Domain (70 years after death of author or if declared as PD)CC0= Creative Commons Public DomainCC-BY= Creative Commons AttributionCC-BY-SA= Creative Commons Attribution-SharealikeCC-BY-ND= Creative Commons Attribution-NoDerivsCC-BY-NC= Creative Commons Attribution-NonCommercialCC-BY-NC-SA= Creative Commons Attribution-NonCommercial-ShareAlikeCC-BY-NC-ND= Creative Commons Attribution-NonCommercial-NoDerivsPGL= Project Gutenberg LicenseGPL= GNU General Public LicenseFDL= GNU Free Documentation LicenseOPL= Open Publication LicenseOCL= Open Content LicenseFAL= Free Art License
In case of CP, use the REM field to note who gave the permission or any special conditions!
If you get several files with a license not listed above, please let me know to add it!
It is possible that a medium is dual-licensed, in that case simply use two LIC lines.
MARC: 542__l // DE: UCA_LICENSE
Example:
LIC PD
REM
Remarks - internal use, but saved to database, thus permanent.
Several REM lines get concatenated.
MARC: 500__a // DE: UCA_REMARKS
Example:
REM Wlad Putin himself allowed us the use of his memoires for free.
DATE
Date of recording or publishing, if applicable.
Date format after ISO 8601, i.e. YYYY-MM-DD.
If you know year and month, use YYYY-MM, otherwise just the year.
If you know only a fuzzy date, like “end of 18th century”, write that.
Examples:
DATE 2013-12-31
DATE Yin dynasty
LOC
Location of publishing or recording.
Donât use this for a location as subject of a medium (like a book about Naryn).
If a book is published in several locations at once, use several LOC lines.
MARC: 260__a // DE: UCA_LOCATION
Example:
LOC Moscow
LEN
Length as number of pages or runtime.
MARC: 300__a, maybe 306__a // DE: UCA_LENGTH
Examples:
LEN 16 Ñ.
LEN 00:21:02
SIZE
Physical dimensions of the medium, like page format or pixel size.
Examples:
SIZE 1024x768px
SIZE 210x297mm
TST
Only in exported data: Timestamp of last change. Gets automatically updated, manual changes are ignored.
MARC: 005 // DE: UCA_TIMESTAMP
Example:
TST 20131130221300.0
