[ 
https://issues.apache.org/jira/browse/TIKA-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455012#comment-17455012
 ] 

ASF GitHub Bot commented on TIKA-3446:
--------------------------------------

nddipiazza opened a new pull request #461:
URL: https://github.com/apache/tika/pull/461


   # Support parsing OneNote files when downloaded from O365
   
   Previous version of Tika OneNote parser was not able to handle files saved 
from Office 365 (SharePoint Online, OneDrive).
   
   See section 2.8 of this document
   
https://interoperability.blob.core.windows.net/files/MS-ONESTORE/%5bMS-ONESTORE%5d.pdf
   
   which describes that MS-ONESTORE documents can be encoded by the following 
spec: 
   
https://interoperability.blob.core.windows.net/files/MS-FSSHTTPB/%5bMS-FSSHTTPB%5d.pdf
   
   Now those getting files from the O365 suite will be able to use the OneNote 
parser. 
   
   # Things to improve later
   
   * Stream instead of use byte array?
   * See if we can use this newer parser code for the on-prem documents too to 
avoid the code bloat?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> OneNote - look into adding support for OneNote 365 documents
> ------------------------------------------------------------
>
>                 Key: TIKA-3446
>                 URL: https://issues.apache.org/jira/browse/TIKA-3446
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.27
>            Reporter: Nicholas DiPiazza
>            Assignee: Nicholas DiPiazza
>            Priority: Major
>
> While doing some parsing of OneNote documents, I was investigating a slew of 
> them that did not seem to parse very well. 
> When I did some digging, I found out that these documents were generated from 
> SharePoint Online. 
> I had hoped that OneNote documents generated from SharePoint Online would 
> just be the same as OnPrem OneNote documents from 2016, 2019 etc. 
> But turns out this is NOT the case. 
> I checked out the Microsoft specification MS-ONESTORE and found that the 
> documents do not match the specifications that are published. 
> Opened a community post: [Looking for the MS spec for OneNote 365 version - 
> Microsoft 
> Q&A|https://docs.microsoft.com/en-us/answers/questions/436943/looking-for-the-ms-spec-for-onenote-365-version-1.html]
> And also opened an internal ticket with Microsoft. 
> They will be responding soon with an analysis of my issue and we'll see if 
> there is anything we can do. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to