[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305747#comment-17305747
 ] 

ASF subversion and git services commented on PDFBOX-5128:
-

Commit 1887908 from Maruan Sahyoun in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1887908 ]

PDFBOX-5128: add missing license header

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-21 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305746#comment-17305746
 ] 

Maruan Sahyoun commented on PDFBOX-5128:


The Prism part of the XMP in PDFBOX-3440 no longer fails. For now I've only 
added it in trunk and this only works if {{strictMode}} is {{false}}. For 
{{strictMode}} being {{true}} (default) it is expected that there is a defined 
XMPSchema with a matching Class existing this will have the benefit that - as 
there is a formal description of the schema - the parsing provides a better 
result. 

For now the parsing doesn't detect different field types and as a result most 
fields are being treated as text type. 

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305744#comment-17305744
 ] 

ASF subversion and git services commented on PDFBOX-5128:
-

Commit 1887907 from Maruan Sahyoun in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1887907 ]

PDFBOX-5128: initial support for parsing arbritary XMPs

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303529#comment-17303529
 ] 

Maruan Sahyoun commented on PDFBOX-5128:


Thank you for providing the files. Will try to add handling some non standard 
files first and then run using the test bed. Not likely before early next week.

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303522#comment-17303522
 ] 

Tim Allison commented on PDFBOX-5128:
-

The process hasn't finished, but I'm dumping the files here:

[https://corpora.tika.apache.org/base/xmps/]

I'm roughly binning them by the file type of the container file, including: 
[https://corpora.tika.apache.org/base/xmps/pdf/] 

 

Let me know if I can do any processing on these or if I botched the extraction.

 

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303391#comment-17303391
 ] 

Tim Allison commented on PDFBOX-5128:
-

Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that 
Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) 
didn't include a closing packet – PDFBOX-1192

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303266#comment-17303266
 ] 

Maruan Sahyoun commented on PDFBOX-5128:


[~tallison] yes, that's fine
[~pwyatt] thank's for the information. I'll look into that as soon as I have 
the base stuff working

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-16 Thread Peter Wyatt (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303064#comment-17303064
 ] 

Peter Wyatt commented on PDFBOX-5128:
-

And just FYI - very soon to be published by ISO is "ISO/DIS 16684-3 Graphic 
technology — Extensible metadata platform (XMP) specification — Part 3: JSON-LD 
serialization of XMP" (https://www.iso.org/standard/79384.html)

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-16 Thread Peter Wyatt (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303062#comment-17303062
 ] 

Peter Wyatt commented on PDFBOX-5128:
-

If you are testing with ZUGFeRD then please also test with Fractur-X (French 
e-invoices). 
You can also find a few sample e-invoices and XMP extension schema at 
[https://www.pdflib.com/pdf-knowledge-base/zugferd-and-factur-x/.|https://www.pdflib.com/pdf-knowledge-base/zugferd-and-factur-x/]


And also note that the XMP ISO standard ISO 16684-1 was relatively recently 
updated and re-released in 2019 (see 
[https://www.iso.org/standard/75163.html).] This replaced the original 2012 
edition. I'm not 100% sure of everything that changed but I believe Rational 
was introduced as a data type and some data points can now be arrays...

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302946#comment-17302946
 ] 

Tim Allison commented on PDFBOX-5128:
-

[~msahyoun] ... does the attached look about right?  If so, I'll run against 
our full corpus and mirror the directory structure.

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread beat weisskopf (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300418#comment-17300418
 ] 

beat weisskopf commented on PDFBOX-5128:


Maybe related, "Zugferd" (for e-invoices) also uses a custom XMP schema. 
https://www.mustangproject.org/ is based on Pdfbox already, there might be some 
samples to be found there.

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300366#comment-17300366
 ] 

Maruan Sahyoun commented on PDFBOX-5128:


Yes, please

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300365#comment-17300365
 ] 

Tim Allison commented on PDFBOX-5128:
-

I’ll scrape xmp out of our regression corpus. I should retain the packet 
envelope?

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org