[ https://issues.apache.org/jira/browse/TIKA-4466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18014520#comment-18014520 ]
Peter Hoogendijk commented on TIKA-4466: ---------------------------------------- [~tallison] - Dublin Core has been designed with minimal restraints to allow specific implementations the flexibility they require (like epub). So I think there are three choices here: * Change only the fields we know are actually being used with multiple values, like now "identifier" as reported by [~zikasak]. * Change all fields that can be used with multiple values according to the specific specs. But at this moment I only know the specs for "epub". * Follow the minimal restraints of the Dublin Core specs and allow all simple fields to have multiple values, even if this is not logical for a specific field (like "created"). As a user of Apache Tika, I'm happily leaving this choice to you and the other developers of this parser. But I'm happy you asked, as I now can adjust my code to be flexible and detect any future changes automatically. FYI: I'm currently using Tika 3.2.2 but I'll certainly be testing this with future snapshots. > OPFParser: Only the last dc:identifier is parsed, while multiple are valid. > --------------------------------------------------------------------------- > > Key: TIKA-4466 > URL: https://issues.apache.org/jira/browse/TIKA-4466 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.2.2 > Reporter: Grigorii Ioffe > Priority: Major > Attachments: image-2025-08-15-10-35-10-476.png, test_file.epub > > > I have an ePub file with metadata stored in an OPF file with multiple > dc:identifier fields. But during its parsing OPFParser extracts only the last > one. > For example, if a OPF file inside ePub contains such entries of dc:identifier: > {code:java} > <dc:identifier>isbn:9780765350381</dc:identifier> > <dc:identifier>mobi-asin:JD4PTHPBGIAQYZUBFUU3VFPVEUKY7S3U</dc:identifier> > <dc:identifier>amazon:0765350386</dc:identifier> > <dc:identifier>goodreads:243272</dc:identifier> > <dc:identifier>calibre:55</dc:identifier> > <dc:identifier>uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> > <dc:identifier > id="uuid_id">uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> {code} > only uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595 will be in parsed metadata. > According to the Dublin Core spec it is a valid situation as identifier > marked as repeatable: > [https://www.w3.org/TR/epub-33/#sec-opf-dcidentifier] > My investigation showed that the field is created with PropertyType.SIMPLE > here: > `org.apache.tika.metadata/DublinCore.class:60` > as a result, > `org.apache.tika.metadata/Property.class:272` > returns false and therefore each entry overrides a value stored before > instead of adding to an array. > > Also, this is not the only field with incorrect type definition. Looks like > that Title, language, description and some others fields are also defined > incorrectly (or at least parsed in OPFParser and DCXmlParcer incorrectly) > -- This message was sent by Atlassian Jira (v8.20.10#820010)