Hello Ferran!

Am 2013-10-14 um 16:12 schrieb Ferran Jorba <[email protected]>:

> First of all, please, please, make sure that those records have an
> unique identifier.  You'll find yourself loading and reloading them,
> either on your test or your production server.  This unique identifier
> will allow Invenio to replace the older version with the new one reusing
> the same Invenio identifier.  You can find a similar thread here:

My idea was to export the same format with the Invenio ID added to use for 
corrections, see my workflow description in attached documents (if the list 
lets them through).

I wrote a Python script to walk through a directory of media and extract as 
much metadata as possible. This writes the inital metadata file that gets 
enhanced/corrected manually by the team. Afterwards we'll inject it into 
Invenio.
At the moment our different sources are just hinted in the file path; the team 
yet has to define what part I should interpret as Source/Publisher and what 
default values other fields may get.

Here are some typical raw samples, full of errors:

DATE    2013-08-05
FILE    
invenio-data/BIOM/igraem_na_prirode_i_reshaem_zadachi_po_ekologii_rasteniy-ecodelo.pdf
LEN     20 p.
SIZE    210.0×297.0mm
TITLE   Igraem Na Prirode I Reshaem Zadachi Po Ekologii Rasteniy-Ecodelo
TITLE   ИГРАЕМ НА ПРИРОДЕ И РЕШАЕМ ЗАДАЧИ  ПО ЭКОЛОГИИ РАСТЕНИЙ

AUT     Администратор
DATE    2009-06-22
FILE    invenio-data/literatura.kg/children_KG/doschul_bala_zhany.pdf
LEN     178 p.
SIZE    148.2×209.9mm
TITLE   Doschul Bala Zhany
TITLE   Досчул бала.

AUT     <F2E0E1FBEBE4FB>
DATE    2009-02-02
FILE    invenio-data/literatura.kg/children_KG/togo_ber_zhamgyr.pdf
LEN     46 p.
SIZE    210.0×297.0mm
TITLE   Togo Ber Zhamgyr
TITLE   <443A5C323030395CC6E0EAE5E3E55CC0E4E0F8EAE0ED5CC0E4E0F8EAE0ED20>

AUT     Victor
AUT     Aline
DATE    2013-09-05
FILE    invenio-data/literatura.kg/children_RU/er_tjoshtuk_skazka.doc
TITLE   Er Tjoshtuk Skazka
TITLE   Эр Тюштюк

AUT     Fyodor Dostoevsky
CAT     Speech
DATE    2007
FILE    invenio-data/gutenberg.org/Russian/21183-01.mp3
LEN     00:05:52.716
SER     White Nights
TITLE   21183-01
TITLE   7 - Morning

AUT     Rachinskii, Sergei Aleksandrovich
DATE    2005-08-14
DATE    2013-03-13T12:07:30.288832+00:00
FILE    invenio-data/gutenberg.org/Russian/16527.epub
KEY     Word problems (Mathematics)
KEY     Mathematics -- Problems, exercises, etc.
LANG    rus
LIC     Public domain in the USA.
REFNO   http://www.gutenberg.org/ebooks/16527
TITLE   16527
TITLE   1001 задача для умственного счета

AUT     Чехов, Антон Павлович
FILE    invenio-data/pocketbook-int.com/Russian/Chehov, Anton - Djadja 
Vanja.epub
KEY     Dramaturgy
KEY     Russian Literature
KEY     Classic Literature
LANG    rus
REFNO   urn:uuid:64425a11-346d-491b-9dfb-63ef1253a31c
TITLE   Chehov, Anton - Djadja Vanja
TITLE   Дядя Ваня


> The second lesson is that, if you end up using «simple» office
> applications, I think that repeteable values are better solved with a
> known character, like the semicolon you seem to use, rather than
> multiple columns, because you can repeat the value zero or more times
> without wasting fields.

We resolved to several lines with the same key code.
I didn’t try yet how that works with BibConvert.

> And finally, I don't know how flexible is BibConvert, as I don't use it,
> but if you feel confortable with Python, probably in the long run it
> will pay to invest on it, as probably you'll have to fiddle with some
> subtle cases where the flexibilty of a real programming language will
> help you.

I hope I'll get away with coding all the exceptions into my dirty little script 
and use a clean BibConvert setup after that.

>>> For a unified search you'd like to have I think at least
>>> Invenio 1.2 (current head master, if Tibor managed to merge
>>> authority based searches yet).
>> 
>> Grmbl, I forked at v1.1.2.473-1ab71 and decoupled. But I must learn
>> how to manage upstream changes in git anyway. At the moment, I set my
>> personal server repository as origin, develop on my Mac and pull
>> changes from the development web server. All my changes
>> (e.g. configuration, web style, docs) are in a personal branch, so it
>> should cause no problems to pull master.
> 
> You may take a look at guilt.  I did a brief introduction during last
> year Invenio Users Meeting that I hope to expand this november.  The
> slides are here
> http://ddd.uab.cat/record/93913


Thanks for the hint! But I'll try to learn git better first.



Greetlings, Hraban
Grüßlinge, Hraban
---
http://www.fiee.net
https://www.cacert.org (I'm an assurer)




INVENIO WebSubmit Data Elements for UCA eBilim

by Henning Hraban Ramm, version 2013–10–14.

for UCA eBilim project.

(c) UCA 2013. License: GNU Free Documentation License.

Preface

Elements define the field types for WebSubmit Document Types.

Please find descriptions and examples in metadata_format_uca.

Title

Subtitle

  • SUBT = UCA_SUBTITLE
  • Text (one line)
  • MARC: 245__b

Series

  • SER = UCA_SERIES
  • Text (one line)
  • MARC: 490__a

Author(s)

  • AUT = UCA_AUTHORS
  • List (one author per line)
  • Format: Last name, first names father’s name [Alias]
  • MARC: 100__a, 700__a

Description/Abstract

  • DESC = UCA_DESCRIPTION
  • Text (multiline)
  • MARC: 520__a

Remarks

Internal comments, e.g. about source, license

  • REM = UCA_REMARKS
  • Text (multiline)
  • MARC: 500__a

Date

  • DATE = UCA_DATE
  • Text (one line), max. length 10
  • preferably ISO date format = yyyy-mm-dd, use yyyy or yyyy-mm or some fuzzy description like “18th century” as appropriate
  • MARC: 260__c

Location

  • LOC = UCA_LOCATION
  • Text (one line)
  • MARC: 260__a

Language

  • LANG = UCA_LANGUAGE
  • Selector (ATM only one value)
  • MARC: 041__a

Reference number

External reference number, e.g. order number, shelf number

ISBN

Only for books.

  • ISBN = UCA_ISBN
  • Text (one line), max. length 16
  • MARC: 020__a

ISSN

Only for magazines.

  • ISSN = UCA_ISSN
  • Text (one line), max. length 16
  • MARC: 022__a

Source

  • SRC = UCA_SOURCE
  • Text (one line)
  • MARC: 541__a

Publisher

  • PUBL = UCA_PUBLISHER
  • Text (one line)
  • MARC: 260__b

License

Keywords

  • KEY = UCA_KEYWORDS
  • List (one keyword or phrase per line)
  • MARC: 653__a

Length

Number of pages or run time in minutes

  • LEN = UCA_LENGTH
  • Text (one line), max. length 4
  • Validator: number (not yet)
  • MARC: 300__a

Size

Physical dimensions of the medium, like page format or pixel size.

Timestamp

Automatically filled on update

  • TST = UCA_TIMESTAMP
  • Text (one line), max. length 16
  • Format: yyyymmddhhmmss.0
  • MARC: 005

Metadata exchange format for UCA eBilim

by Henning Hraban Ramm, version 2013–10–14.

for UCA eBilim project.

(c) UCA 2013. License: GNU Free Documentation License.

Preface

This document defines a format for metadata exchange between the UCA eBilim team and the contractor for conversion and upload to Invenio, as agreed upon on 2013–10–11.

We use a simple plain text format to collect metadata for our media files. This gets converted to MarcXML, Invenio’s native import format.

Sample

FILE    Literature/Educational/16527.epub
CAT     Educational;Math
TITLE   1001 задача для умственного счета  
AUT     Rachinskij, Sergej Aleksandrovich
DESC    A collection of mathematical puzzles    
LANG    rus 
DATE    1899    
PUBL    Project Gutenberg
SRC     gutenberg.org
LIC     PGL
REFNO   16527
KEY     sadanye
KEY     puzzle
# author’s name and description should be in Russian

ID      123454
FILE    Misc/Data/next.pdf
CAT     Literature
CAT     Children;Something;Subcategory
TITLE   Some other book
SUB     Tales from the playground
DESC    This is just a silly test entry that should show some features of the input format.
AUT     Ramm, Henning Hraban
AUT     Rosset, Aline
LANG    eng
LANG    deu
LEN     256 с.
DATE    2013-10
LOC     Bishkek
PUBL    édition fiëé
SRC     Authors
LIC     PD
KEY     obsolete
KEY     just a test

Definitions

  • The input file is a plain text file, encoded in UTF–8. We can accept documents in MS Word or OpenDocument (OpenOffice/LibreOffice) formats.
  • There will be different files per media type (Invenio: collection), e.g. Video, Book, Audio.
  • A block of metadata (a record) describing one media file consists of definition lines without empty lines. The number of lines per record can vary.
  • Each line defines one metadata item. It starts with an uppercase key code and one or more tabulator characters. Don’t use space characters after the key code!
  • Empty lines delimit records.
  • The order of lines in a record doesn’t matter. Exception: first AUT line is the primary author, further AUT lines get secondary authors.
  • Lines can get as long as needed. Soft line breaking of your word processor or text editor is no problem, but avoid hard line breaks!
  • Comment lines start with a # (hash, fence, number sign), they’re ignored by the processing program. Use them for temporary comments, they don’t end up in the database. You can use comment lines also for optical structuring of input files or markers like “### Here I stopped working!”.
  • Generally, use cyrillic script for Russian, Kyrgyz etc., no transliteration!
  • “Marc” denotes the corresponding Marc field number, it doesn’t matter for members of the team.

Workflow

Initial Upload

  • UCA team (Aline) sorts collected media data on an external harddrive.
  • Contractor (Hraban) prepares raw metadata files from this file/directory structure, just containing the FILE lines and possibly mechanically extractable data (like titles from PDFs), and sends them to UCA team.
  • UCA team fills metadata files and sends them back to contractor.
  • Contractor checks these files and suggests/requests corrections.
  • UCA team makes corrections.
  • Contractor converts metadata to MarcXML and uploads into Invenio server.

Corrections / Maintenance

  • Contractor exports metadata from Invenio database, converts it to the here defined format and sends to UCA team. (These files will contain ID lines.)
  • UCA team makes corrections, leaving ID lines unchanged, and sends back to contractor.
  • Contractor checks, converts and uploads new metadata; unchanged data will get ignored, changed data will overwrite old values.

Metadata keys

ID

This is an Invenio-internal unique media identification number, never change it! It will appear only on exported data (see section “workflow”). Don’t try to provide it yourself!

MARC: 001 // DE: ?

FILE

Relative file path for import/upload. The root directory of this is to be defined elsewhere.

The file name may contain spaces, but please avoid spaces in directory names! File or directory names must not start with spaces! Unicode file or directory names (e.g. Russian/Kyrgyz) are valid.

Path separator is always a forward slash (/, Unix style), not a backslash (\, Windows style).

FILE may appear several times if the media consists of several files (e.g. audio books). FILE may be left out, if there is no file to upload (e.g. entry for paper books or DVDs).

MARC: FFT__a // DE: ?

Example:

FILE    Children/Audio/Fairytales/1200001.mp3
FILE    Children/Audio/Fairytales/1200002.mp3
FILE    Children/Audio/Fairytales/1200003.mp3

CAT

Category, as defined by taxonomy. Hierarchical categories are separated by a colon (;).

CAT may appear several times to put a media into several categories.

MARC: ? // DE: ?

Example:

CAT Children;Entertainment
CAT Entertainment;Folklore;Fairytales

TITLE

Main title or name of the media.

TITLE must appear at least once. It may appear several times if the media has several titles of the same level, e.g. in different languages. If several titles are hierarchical, use SUBT.

With audio books that consist of several chapters, use this for the main title (not the chapter title).

(We didn’t use the short form TIT to avoid offending anyone ;-)

MARC: 245__a // DE: UCA_TITLE

Example:

TITLE   Baba Yaga

SUBT

Subtitle of the media.

SUBT may appear several times, e.g. for subtitles in different languages. Several subtitles of the same level and in the same language should use only one SUBT line.

With audio books that consist of several chapters, use this for the chapter title.

MARC: 245__b // DE: UCA_SUBTITLE

Example:

SUBT    The hut on chicken feet

AUT

Author. Use the form “Last name, first name(s) [Alias]”, as applicable.

AUT may appear several times for several authors. The first AUT line is stored as primary author, further AUT lines as secondary authors.

MARC: 100__a, 700__a // DE: UCA_AUTHOR

Example:

AUT Andersen, Christian
AUT Uljanow, Wladimir Iljitsch [Lenin]

SER

Series title; also usable for e.g. album name of songs.

MARC: 490__a // DE: UCA_SERIES

Examples:

SER Live in Bishkek
SER A Song of Ice and Fire

DESC

Description or abstract.

DESC should appear only once. Several DESC lines may get combined or ignored…

MARC: 520__a // DE: UCA_DESCRIPTION

Example:

DESC    A long and winding novel about a man and a woman, that meet somewhere in Central Asia.

LANG

Main language of the media.

Use only three-letter codes from ISO 639–3, see e.g. SIL or Ethnologue

  • rus = Russian
  • kir = Kyrgyz
  • uzn = Uzbek (actually North Uzbek, as spoken in Kyrgyzstan)
  • tgk = Tadjik
  • dng = Dungan
  • eng = English (no further specification)
  • deu = German
  • fra = French

LANG may appear several times if the media is multilingual.

MARC: 041__a // DE: UCA_LANGUAGE

Example:

LANG    rus

KEY

Keyword or key phrase.

One word/phrase per line. No translations.

MARC: 653__a // DE: UCA_KEYWORDS

Example:

KEY unemployment
KEY on the dole

REFNO

Foreign reference number, i.e. number of this media in original project or DOI or order number etc.

MARC: 088__a // DE: UCA_REFNO

Example:

REFNO   10.1000/182

ISBN or ISSN

International Standard Book/Serial Number

MARC: 020__a / 021__a // DE: UCA_ISBN, UCA_ISSN

Example:

ISSN    0361-526X
ISBN    978-3-86680-192-9

SRC

Source, i.e. where or from whom you got the file, e.g. name of organization or URL of website

MARC: 541__a // DE: UCA_SOURCE

Examples:

SRC gutenberg.org

SRC Author’s widow

PUBL

Publisher / publishing organization

MARC: 260__b // DE: UCA_PUBLISHER

Example:

PUBL    University of Central Asia

LIC

Name or abbreviation of License

Note that you must regard any media as copyrighted if you don’t know better!

You can either write the full name of the license or better use an abbreviation from this list:

In case of CP, use the REM field to note who gave the permission or any special conditions!

If you get several files with a license not listed above, please let me know to add it!

It is possible that a medium is dual-licensed, in that case simply use two LIC lines.

MARC: 542__l // DE: UCA_LICENSE

Example:

LIC PD

REM

Remarks - internal use, but saved to database, thus permanent.

Several REM lines get concatenated.

MARC: 500__a // DE: UCA_REMARKS

Example:

REM Wlad Putin himself allowed us the use of his memoires for free.

DATE

Date of recording or publishing, if applicable.

Date format after ISO 8601, i.e. YYYY-MM-DD. If you know year and month, use YYYY-MM, otherwise just the year. If you know only a fuzzy date, like “end of 18th century”, write that.

MARC: 260__c // DE: UCA_DATE

Examples:

DATE    2013-12-31

DATE    Yin dynasty

LOC

Location of publishing or recording.

Don’t use this for a location as subject of a medium (like a book about Naryn). If a book is published in several locations at once, use several LOC lines.

MARC: 260__a // DE: UCA_LOCATION

Example:

LOC Moscow

LEN

Length as number of pages or runtime.

MARC: 300__a, maybe 306__a // DE: UCA_LENGTH

Examples:

LEN 16 с.

LEN 00:21:02

SIZE

Physical dimensions of the medium, like page format or pixel size.

MARC: 300__c // DE: UCA_SIZE

Examples:

SIZE    1024x768px
SIZE    210x297mm

TST

Only in exported data: Timestamp of last change. Gets automatically updated, manual changes are ignored.

MARC: 005 // DE: UCA_TIMESTAMP

Example:

TST 20131130221300.0

Reply via email to