[ 
https://issues.apache.org/jira/browse/TIKA-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gregory Lepore updated TIKA-4088:
---------------------------------
    Description: 
The SEG Y format occurs 2,390 times (roughly) in the latest Common Crawl 
dataset. No known mime type. Magic is:

Offset 0: C340(F1|40)40

Offset 80: C3

Offset 160: C3

 

With additional C3 every 80 bytes 38 more times. However, the above matched all 
SEG Y files in my test collections, with no false positives, so it should be 
good enough.

 

File extension is .segy and .sgy.

 

[https://web.archive.org/web/20160312030348/http://www.seg.org/resources/publications/misc/technical-standards]

 

A different signature at:

 

[https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1110&strPageToDisplay=signatures]

  was:
The SEG Y format occurs 2,390 times (roughly) in the latest Common Crawl 
dataset. No known mime type. Magic is:

Offset 0: C340(F1|40)40

Offset 80: C3

Offset 160: C3

 

With additional C3 every 80 bytes 38 more times. However, the above matched all 
SEG Y files in my test collections, with no false positives, so it should be 
good enough.

 

File extension is .segy.

 

[https://web.archive.org/web/20160312030348/http://www.seg.org/resources/publications/misc/technical-standards]

 

A different signature at:

 

https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1110&strPageToDisplay=signatures


> Add magic for SEG Y format
> --------------------------
>
>                 Key: TIKA-4088
>                 URL: https://issues.apache.org/jira/browse/TIKA-4088
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: 
> 0b518f422c100574e3ae8963842bd18bdb4ad27254022fcb6dafc7fe1b7d366c, 
> 38a58826f54edafebab0bb4381f2b39f7b9b6c04fbba3c948f18fe6861bd14d1, 
> 79d81ddaad3582d71b596c819e0c61d43b092f8dbafb1d4400199673cfce0a8f
>
>
> The SEG Y format occurs 2,390 times (roughly) in the latest Common Crawl 
> dataset. No known mime type. Magic is:
> Offset 0: C340(F1|40)40
> Offset 80: C3
> Offset 160: C3
>  
> With additional C3 every 80 bytes 38 more times. However, the above matched 
> all SEG Y files in my test collections, with no false positives, so it should 
> be good enough.
>  
> File extension is .segy and .sgy.
>  
> [https://web.archive.org/web/20160312030348/http://www.seg.org/resources/publications/misc/technical-standards]
>  
> A different signature at:
>  
> [https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1110&strPageToDisplay=signatures]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to