[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305747#comment-17305747 ] ASF subversion and git services commented on PDFBOX-5128: - Commit 1887908 from Maruan Sahyoun in branch 'pdfbox/trunk' [ https://svn.apache.org/r1887908 ] PDFBOX-5128: add missing license header > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305746#comment-17305746 ] Maruan Sahyoun commented on PDFBOX-5128: The Prism part of the XMP in PDFBOX-3440 no longer fails. For now I've only added it in trunk and this only works if {{strictMode}} is {{false}}. For {{strictMode}} being {{true}} (default) it is expected that there is a defined XMPSchema with a matching Class existing this will have the benefit that - as there is a formal description of the schema - the parsing provides a better result. For now the parsing doesn't detect different field types and as a result most fields are being treated as text type. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305744#comment-17305744 ] ASF subversion and git services commented on PDFBOX-5128: - Commit 1887907 from Maruan Sahyoun in branch 'pdfbox/trunk' [ https://svn.apache.org/r1887907 ] PDFBOX-5128: initial support for parsing arbritary XMPs > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303529#comment-17303529 ] Maruan Sahyoun commented on PDFBOX-5128: Thank you for providing the files. Will try to add handling some non standard files first and then run using the test bed. Not likely before early next week. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303522#comment-17303522 ] Tim Allison commented on PDFBOX-5128: - The process hasn't finished, but I'm dumping the files here: [https://corpora.tika.apache.org/base/xmps/] I'm roughly binning them by the file type of the container file, including: [https://corpora.tika.apache.org/base/xmps/pdf/] Let me know if I can do any processing on these or if I botched the extraction. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303391#comment-17303391 ] Tim Allison commented on PDFBOX-5128: - Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) didn't include a closing packet – PDFBOX-1192 > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303266#comment-17303266 ] Maruan Sahyoun commented on PDFBOX-5128: [~tallison] yes, that's fine [~pwyatt] thank's for the information. I'll look into that as soon as I have the base stuff working > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303064#comment-17303064 ] Peter Wyatt commented on PDFBOX-5128: - And just FYI - very soon to be published by ISO is "ISO/DIS 16684-3 Graphic technology — Extensible metadata platform (XMP) specification — Part 3: JSON-LD serialization of XMP" (https://www.iso.org/standard/79384.html) > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303062#comment-17303062 ] Peter Wyatt commented on PDFBOX-5128: - If you are testing with ZUGFeRD then please also test with Fractur-X (French e-invoices). You can also find a few sample e-invoices and XMP extension schema at [https://www.pdflib.com/pdf-knowledge-base/zugferd-and-factur-x/.|https://www.pdflib.com/pdf-knowledge-base/zugferd-and-factur-x/] And also note that the XMP ISO standard ISO 16684-1 was relatively recently updated and re-released in 2019 (see [https://www.iso.org/standard/75163.html).] This replaced the original 2012 edition. I'm not 100% sure of everything that changed but I believe Rational was introduced as a data type and some data points can now be arrays... > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302946#comment-17302946 ] Tim Allison commented on PDFBOX-5128: - [~msahyoun] ... does the attached look about right? If so, I'll run against our full corpus and mirror the directory structure. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > Attachments: PDFBOX.zip > > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300418#comment-17300418 ] beat weisskopf commented on PDFBOX-5128: Maybe related, "Zugferd" (for e-invoices) also uses a custom XMP schema. https://www.mustangproject.org/ is based on Pdfbox already, there might be some samples to be found there. > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300366#comment-17300366 ] Maruan Sahyoun commented on PDFBOX-5128: Yes, please > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP
[ https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300365#comment-17300365 ] Tim Allison commented on PDFBOX-5128: - I’ll scrape xmp out of our regression corpus. I should retain the packet envelope? > Support parsing non standardized XMP > - > > Key: PDFBOX-5128 > URL: https://issues.apache.org/jira/browse/PDFBOX-5128 > Project: PDFBox > Issue Type: Task > Components: XmpBox >Reporter: Maruan Sahyoun >Assignee: Maruan Sahyoun >Priority: Major > > XMP currently only supports parsing known XMP schema as has been discussed. > That shall be extended to support arbitrary but valid XMP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org