Re: [PR] Bump aws.version from 1.12.656 to 1.12.657 [tika]

2024-02-12 Thread via GitHub


THausherr merged PR #1592:
URL: https://github.com/apache/tika/pull/1592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Bump aws.version from 1.12.656 to 1.12.657 [tika]

2024-02-12 Thread via GitHub


dependabot[bot] opened a new pull request, #1592:
URL: https://github.com/apache/tika/pull/1592

   Bumps `aws.version` from 1.12.656 to 1.12.657.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.656 to 1.12.657
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.657 2024-02-12
   AWS AppSync
   
   
   Features
   
   Adds support for new options on GraphqlAPIs, Resolvers and  Data Sources 
for emitting Amazon CloudWatch metrics for enhanced monitoring of AppSync 
APIs.
   
   
   
   Amazon CloudWatch
   
   
   Features
   
   This release enables PutMetricData API request payload compression by 
default.
   
   
   
   Amazon Neptune Graph
   
   
   Features
   
   Adding a new option parameters for data plane api 
ExecuteQuery to support running parameterized query via SDK.
   
   
   
   Amazon Route 53 Domains
   
   
   Features
   
   This release adds bill contact support for RegisterDomain, 
TransferDomain, UpdateDomainContact and GetDomainDetail API.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/3d1c7de71d5fbb74d542f6634778d61254ba0667;>3d1c7de
 AWS SDK for Java 1.12.657
   https://github.com/aws/aws-sdk-java/commit/4ce0ae3633f217e6b461cfd03b76e206782058a6;>4ce0ae3
 Update GitHub version number to 1.12.657-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.656...1.12.657;>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.656 to 1.12.657
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.657 2024-02-12
   AWS AppSync
   
   
   Features
   
   Adds support for new options on GraphqlAPIs, Resolvers and  Data Sources 
for emitting Amazon CloudWatch metrics for enhanced monitoring of AppSync 
APIs.
   
   
   
   Amazon CloudWatch
   
   
   Features
   
   This release enables PutMetricData API request payload compression by 
default.
   
   
   
   Amazon Neptune Graph
   
   
   Features
   
   Adding a new option parameters for data plane api 
ExecuteQuery to support running parameterized query via SDK.
   
   
   
   Amazon Route 53 Domains
   
   
   Features
   
   This release adds bill contact support for RegisterDomain, 
TransferDomain, UpdateDomainContact and GetDomainDetail API.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/3d1c7de71d5fbb74d542f6634778d61254ba0667;>3d1c7de
 AWS SDK for Java 1.12.657
   https://github.com/aws/aws-sdk-java/commit/4ce0ae3633f217e6b461cfd03b76e206782058a6;>4ce0ae3
 Update GitHub version number to 1.12.657-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.656...1.12.657;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816827#comment-17816827
 ] 

Tim Allison commented on TIKA-3784:
---

Well, sure, if you want to make it easy! Y, let's go with something like that!

I'll see what I can do tomorrow. Thank you!

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816811#comment-17816811
 ] 

Lonzak edited comment on TIKA-3784 at 2/12/24 11:16 PM:


PKCS12 is not the easiest format :-|

The oid for pkcs12 starts with "1.2.840.113549.1.12"

I decoded one pkcs12 example (from redhat) and got the following:
{code:java}

 3
 
  1.2.840.113549.1.7.1
  
   
    
     
      
       1.2.840.113549.1.7.1
       
        
         
          
           
            1.2.840.113549.1.12.10.1.2
            
             
              
               1.2.840.113549.1.12.1.3
               
                0xC8CCE579B6DE5B393F7C4885714C04BA
                2000
               
              
              0x...(shortened fro readability)
             
            
            
             
              1.2.840.113549.1.9.20
              
               0x00630061
              
             
             
              1.2.840.113549.1.9.21
              
               
0x0CDA92EB395D4697A9D178352AF6B2BF06947888
              
             
            
           
          
         
        
       
      
      
       1.2.840.113549.1.7.6
       
        
         
         
          1.2.840.113549.1.7.1
          
           1.2.840.113549.1.12.1.6
           
            0x7F432D60BCD2888476E6CB9CD2BC69F1
            2000
           
          
          
           0x...(shortened fro readability)
           0x0E8E4C15DCB1D87F
          
         
        
       
      
     
    
   
  
 
 
  
   
    1.3.14.3.2.26
    
   
   0x6DFFA14B5A8A32A87DAD2CFCE1EAEBDAFB89C897
  
  0x826699C21B9A4C9E3E608D3C8FBD2310
  2000
 
 {code}
The oid is 3x in there - among others.  The following things point to a pkcs12 
format:
 # Presence of PKCS#12-specific object identifiers (OIDs):
 ## PKCS#12 Bag Types: presence of OIDs such as 1.2.840.113549.1.12.10.1.x, 
which indicate different types of key and certificate bags (KeyBags, CertBags, 
etc.).
 ## PKCS#12 PbeIds: Encryption and hashing OIDs such as 
1.2.840.113549.1.12.1.x, which indicate the use of specific encryption 
mechanisms
 # Use of encryption schemes:
 ## Recognize encryption schemes, especially those that are typical for 
PKCS#12, such as pbeWithSHAAnd3-KeyTripleDES-CBC and pbeWithSHAAnd40BitRC2-CBC. 
These schemes are crucial for the security of PKCS#12 files and a clear 
indication of their presence.
 #  Structure of the file:
 ## Analyzing the file structure for multi-level nested SEQUENCE and 
OCTET_STRING elements, which are typically used to store encrypted private keys 
and certificates. The complexity of this structure is characteristic of PKCS#12 
files.
 # Specific attributes:
 ## PKCS#9 attributes such as friendlyName (OID 1.2.840.113549.1.9.20) and 
localKeyID (OID 1.2.840.113549.1.9.21) are commonly used to provide metadata 
for keys and certificates within the container.
 #  ... and more

 

However since we are already talking about Libraries - Standard Java Crypto and 
BouncyCastle have all this already inside. They are parsing the structures, 
analyze and use it. So using one of these two would be the easiest solution 
imho. I have never written a Detector so please excuse my ignorance:
{code:java}
public class PKCS12Detector implements Detector {    
private static final long serialVersionUID = -8414458255467101503L;
    private static final MediaType PKCS12_MEDIA_TYPE = 
MediaType.application("x-pkcs12");    

@Override
    public MediaType detect(InputStream input, Metadata metadata) {
        try {
            KeyStore keyStore = KeyStore.getInstance("PKCS12");
            keyStore.load(input, null);
            return PKCS12_MEDIA_TYPE; // success
        }
        catch (Exception e) {
            return MediaType.OCTET_STREAM; // something else
        }
    }
} {code}
A bouncy castle one would look quite similar. And also the Keystore loading 
takes quite some time..


was (Author: tom_1st):
PKCS12 is not the easiest format :-|

The oid for pkcs12 starts with "1.2.840.113549.1.12"

I decoded one pkcs12 example (from redhat) and got the following:
{code:java}

 3
 
  1.2.840.113549.1.7.1
  
   
    
     
      
       1.2.840.113549.1.7.1
       
        
         
          
           
            1.2.840.113549.1.12.10.1.2
            
             
              
               1.2.840.113549.1.12.1.3
               
                0xC8CCE579B6DE5B393F7C4885714C04BA
                2000
               
              
              0x...(shortened fro readability)
             
            
            
             
              1.2.840.113549.1.9.20
              
               0x00630061
              
             
             
              1.2.840.113549.1.9.21
              
               
0x0CDA92EB395D4697A9D178352AF6B2BF06947888
              
             
            
           

[jira] [Comment Edited] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816811#comment-17816811
 ] 

Lonzak edited comment on TIKA-3784 at 2/12/24 11:15 PM:


PKCS12 is not the easiest format :-|

The oid for pkcs12 starts with "1.2.840.113549.1.12"

I decoded one pkcs12 example (from redhat) and got the following:
{code:java}

 3
 
  1.2.840.113549.1.7.1
  
   
    
     
      
       1.2.840.113549.1.7.1
       
        
         
          
           
            1.2.840.113549.1.12.10.1.2
            
             
              
               1.2.840.113549.1.12.1.3
               
                0xC8CCE579B6DE5B393F7C4885714C04BA
                2000
               
              
              0x...(shortened fro readability)
             
            
            
             
              1.2.840.113549.1.9.20
              
               0x00630061
              
             
             
              1.2.840.113549.1.9.21
              
               
0x0CDA92EB395D4697A9D178352AF6B2BF06947888
              
             
            
           
          
         
        
       
      
      
       1.2.840.113549.1.7.6
       
        
         
         
          1.2.840.113549.1.7.1
          
           1.2.840.113549.1.12.1.6
           
            0x7F432D60BCD2888476E6CB9CD2BC69F1
            2000
           
          
          
           0x...(shortened fro readability)
           0x0E8E4C15DCB1D87F
          
         
        
       
      
     
    
   
  
 
 
  
   
    1.3.14.3.2.26
    
   
   0x6DFFA14B5A8A32A87DAD2CFCE1EAEBDAFB89C897
  
  0x826699C21B9A4C9E3E608D3C8FBD2310
  2000
 
 {code}
The oid is 3x in there - among others.  The following things point to a pkcs12 
format:
 # Presence of PKCS#12-specific object identifiers (OIDs):
 ## PKCS#12 Bag Types: presence of OIDs such as 1.2.840.113549.1.12.10.1.x, 
which indicate different types of key and certificate bags (KeyBags, CertBags, 
etc.).
 ## PKCS#12 PbeIds: Encryption and hashing OIDs such as 
1.2.840.113549.1.12.1.x, which indicate the use of specific encryption 
mechanisms
 # Use of encryption schemes:
 ## Recognize encryption schemes, especially those that are typical for 
PKCS#12, such as pbeWithSHAAnd3-KeyTripleDES-CBC and pbeWithSHAAnd40BitRC2-CBC. 
These schemes are crucial for the security of PKCS#12 files and a clear 
indication of their presence.
 #  Structure of the file:
 ## Analyzing the file structure for multi-level nested SEQUENCE and 
OCTET_STRING elements, which are typically used to store encrypted private keys 
and certificates. The complexity of this structure is characteristic of PKCS#12 
files.
 # Specific attributes:
 ## PKCS#9 attributes such as friendlyName (OID 1.2.840.113549.1.9.20) and 
localKeyID (OID 1.2.840.113549.1.9.21) are commonly used to provide metadata 
for keys and certificates within the container.
 #  ... and more

 

However since we are already talking about Libraries - Standard Java Crypto and 
BouncyCastle have all this already inside. They are parsing the structures, 
analyze and use it. So using one of these two would be the easiest solution 
imho. I have never written a Detector so please excuse my ignorance:
{code:java}
public class PKCS12Detector implements Detector {    
private static final long serialVersionUID = -8414458255467101503L;
    private static final MediaType PKCS12_MEDIA_TYPE = 
MediaType.application("x-pkcs12");    

@Override
    public MediaType detect(InputStream input, Metadata metadata) {
        try {
            KeyStore keyStore = KeyStore.getInstance("PKCS12");
            keyStore.load(input, null);
            return PKCS12_MEDIA_TYPE; // success
        }
        catch (Exception e) {
            return MediaType.OCTET_STREAM; // something else
        }
    }
} {code}
A bouncy castle one would look quite similar...


was (Author: tom_1st):
PKCS12 is not the easiest format :-|

The oid for pkcs12 starts with "1.2.840.113549.1.12"

I decoded one pkcs12 example (from redhat) and got the following:
{code:java}

 3
 
  1.2.840.113549.1.7.1
  
   
    
     
      
       1.2.840.113549.1.7.1
       
        
         
          
           
            1.2.840.113549.1.12.10.1.2
            
             
              
               1.2.840.113549.1.12.1.3
               
                0xC8CCE579B6DE5B393F7C4885714C04BA
                2000
               
              
              0x...(shortened fro readability)
             
            
            
             
              1.2.840.113549.1.9.20
              
               0x00630061
              
             
             
              1.2.840.113549.1.9.21
              
               
0x0CDA92EB395D4697A9D178352AF6B2BF06947888
              
             
            
           
          
         
        
       
      
      

[jira] [Updated] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Lonzak (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lonzak updated TIKA-4194:
-
Description: 
We use tika to detect the type of a file which is uploaded. In most cases this 
works quite well. However recently some files were rejected because tika 
reports an invalid file type. We'll get
{code:java}
APPLICATION/OCTET-STREAM{code}
instead of
{code:java}
APPLICATION/X-X509-KEY{code}
(As pointed out in TIKA-3784 the mimetype should really be 
"application/x-pkcs12" but for us "application/x-x509-key" works for now)

 

I did an analysis and found that tika doesn't recognize certain types of pkcs12 
keystores. The test keystores can be found 
[here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].

I created a list to show which ones are effected.  Out of 157 keystores 132 are 
correctly detected and 25 are not.

 
||#||correct?||type||filename||
|1|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|2|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|3|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|4|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|5|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
|6|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|7|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|8|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|9|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|10|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|11|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
|12|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|13|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|14|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|15|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|16|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|17|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|18|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|19|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(hmacWithSHA256)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|20|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(8),prf(default)),rc2-cbc(keyBits(120=64bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|21|OK|APPLICATION/X-X509-KEY; 

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816811#comment-17816811
 ] 

Lonzak commented on TIKA-3784:
--

PKCS12 is not the easiest format :-|

The oid for pkcs12 starts with "1.2.840.113549.1.12"

I decoded one pkcs12 example (from redhat) and got the following:
{code:java}

 3
 
  1.2.840.113549.1.7.1
  
   
    
     
      
       1.2.840.113549.1.7.1
       
        
         
          
           
            1.2.840.113549.1.12.10.1.2
            
             
              
               1.2.840.113549.1.12.1.3
               
                0xC8CCE579B6DE5B393F7C4885714C04BA
                2000
               
              
              0x...(shortened fro readability)
             
            
            
             
              1.2.840.113549.1.9.20
              
               0x00630061
              
             
             
              1.2.840.113549.1.9.21
              
               
0x0CDA92EB395D4697A9D178352AF6B2BF06947888
              
             
            
           
          
         
        
       
      
      
       1.2.840.113549.1.7.6
       
        
         
         
          1.2.840.113549.1.7.1
          
           1.2.840.113549.1.12.1.6
           
            0x7F432D60BCD2888476E6CB9CD2BC69F1
            2000
           
          
          
           0x...(shortened fro readability)
           0x0E8E4C15DCB1D87F
          
         
        
       
      
     
    
   
  
 
 
  
   
    1.3.14.3.2.26
    
   
   0x6DFFA14B5A8A32A87DAD2CFCE1EAEBDAFB89C897
  
  0x826699C21B9A4C9E3E608D3C8FBD2310
  2000
 
 {code}
The following things points to a pkcs12 format:
 # Presence of PKCS#12-specific object identifiers (OIDs):
 ## PKCS#12 Bag Types: presence of OIDs such as 1.2.840.113549.1.12.10.1.x, 
which indicate different types of key and certificate bags (KeyBags, CertBags, 
etc.).
 ## PKCS#12 PbeIds: Encryption and hashing OIDs such as 
1.2.840.113549.1.12.1.x, which indicate the use of specific encryption 
mechanisms 
 # Use of encryption schemes:
 ## Recognize encryption schemes, especially those that are typical for 
PKCS#12, such as pbeWithSHAAnd3-KeyTripleDES-CBC and pbeWithSHAAnd40BitRC2-CBC. 
These schemes are crucial for the security of PKCS#12 files and a clear 
indication of their presence.
 #  Structure of the file:
 ## Analyzing the file structure for multi-level nested SEQUENCE and 
OCTET_STRING elements, which are typically used to store encrypted private keys 
and certificates. The complexity of this structure is characteristic of PKCS#12 
files.
 # Specific attributes:
 ## PKCS#9 attributes such as friendlyName (OID 1.2.840.113549.1.9.20) and 
localKeyID (OID 1.2.840.113549.1.9.21) are commonly used to provide metadata 
for keys and certificates within the container.
 #  ... and more

 

However since we are already talking about Libraries - Standard Java Crypto and 
BouncyCastle have all this already inside. They are parsing the structures, 
analyze and use it. So using one of these two would be the easiest solution 
imho. I have never written a Detector so please excuse my ignorance:
{code:java}
public class PKCS12Detector implements Detector {    
private static final long serialVersionUID = -8414458255467101503L;
    private static final MediaType PKCS12_MEDIA_TYPE = 
MediaType.application("x-pkcs12");    

@Override
    public MediaType detect(InputStream input, Metadata metadata) {
        try {
            KeyStore keyStore = KeyStore.getInstance("PKCS12");
            keyStore.load(input, null);
            return PKCS12_MEDIA_TYPE; // success
        }
        catch (Exception e) {
            return MediaType.OCTET_STREAM; // something else
        }
    }
} {code}
A bouncy castle one would look quite similar...

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                     

[jira] [Commented] (TIKA-4191) tika-core and other deps should be "provided" in non-app contexts

2024-02-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816804#comment-17816804
 ] 

Hudson commented on TIKA-4191:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1505 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1505/])
TIKA-4191 -- reduce tika-core's scope to "provided" where possible (#1575) 
(github: 
[https://github.com/apache/tika/commit/fb6ba1a33a225d91de1e2d162317ae629ee8c3ab])
* (edit) tika-app/pom.xml
* (edit) tika-eval/tika-eval-app/pom.xml
* (edit) tika-server/tika-server-core/pom.xml
* (edit) tika-eval/tika-eval-core/pom.xml
* (edit) tika-fuzzing/pom.xml
* (edit) tika-translate/pom.xml
* (edit) tika-xmp/pom.xml
* (edit) tika-java7/pom.xml
* (edit) CHANGES.txt
* (edit) tika-batch/pom.xml


> tika-core and other deps should be "provided" in non-app contexts
> -
>
> Key: TIKA-4191
> URL: https://issues.apache.org/jira/browse/TIKA-4191
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-12 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816788#comment-17816788
 ] 

Nick Burch commented on TIKA-3784:
--

>From [https://datatracker.ietf.org/doc/rfc7292/] it looks like PKCS12 is based 
>on PKCS7, so that's expected. There's a few more types defined in 
>[https://www.rfc-editor.org/rfc/rfc7292.html#appendix-D] - not sure if we can 
>find any of those to match on?

Thought [https://www.cs.auckland.ac.nz/~pgut001/pubs/pfx.html] does suggest 
this isn't an ideal format...

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4196) Add a BOM charset detector

2024-02-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816779#comment-17816779
 ] 

Hudson commented on TIKA-4196:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1504 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1504/])
TIKA-4196 -- add a bom EncodingDetector (#1590) (github: 
[https://github.com/apache/tika/commit/7c758c31e6e3f52b4c5f8ad2ac8169dc0f8b310a])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/BOMDetectorTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/BOMDetector.java


> Add a BOM charset detector
> --
>
> Key: TIKA-4196
> URL: https://issues.apache.org/jira/browse/TIKA-4196
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>
> The ICU4j and the StandardHtmlEncodingDetector detectors include a bom 
> detector, but for some use cases it would be useful to factor that out and 
> allow users to configure bom detection on their own.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816781#comment-17816781
 ] 

Hudson commented on TIKA-4194:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1504 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1504/])
[TIKA-4194] Fix for unrecognized pkcs12 keystores (#1589) (github: 
[https://github.com/apache/tika/commit/c2acd713bb31b88419ebc70dd31c4bfb23bd390f])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816780#comment-17816780
 ] 

Hudson commented on TIKA-4195:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1504 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1504/])
TIKA-4195 -- jsoup parser shouldn't conceal backoff to default encoding (#1591) 
(github: 
[https://github.com/apache/tika/commit/455409bf80801152e7c855ddc994fedc32c4cfcf])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
* (edit) tika-core/src/main/java/org/apache/tika/detect/AutoDetectReader.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* (edit) 
tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
* (edit) 
tika-core/src/main/java/org/apache/tika/detect/CompositeEncodingDetector.java


> JSoupParser conceals null from the EncodingDetector
> ---
>
> Key: TIKA-4195
> URL: https://issues.apache.org/jira/browse/TIKA-4195
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> The JSoupParser runs encoding detection on the InputStream. If the result is 
> null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4191) tika-core and other deps should be "provided" in non-app contexts

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816715#comment-17816715
 ] 

ASF GitHub Bot commented on TIKA-4191:
--

tballison merged PR #1575:
URL: https://github.com/apache/tika/pull/1575




> tika-core and other deps should be "provided" in non-app contexts
> -
>
> Key: TIKA-4191
> URL: https://issues.apache.org/jira/browse/TIKA-4191
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4191 -- reduce tika-core's scope to "provided" where possible [tika]

2024-02-12 Thread via GitHub


tballison merged PR #1575:
URL: https://github.com/apache/tika/pull/1575


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (TIKA-4197) Downgrade jackrabbit in 2.x

2024-02-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4197.
---
Fix Version/s: 2.9.2
   Resolution: Fixed

> Downgrade jackrabbit in 2.x
> ---
>
> Key: TIKA-4197
> URL: https://issues.apache.org/jira/browse/TIKA-4197
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.9.2
>
>
> Looks like the latest jackrabbit requires Java 11: 
> https://github.com/apache/tika/actions/runs/7875864667/job/21488695827



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4197) Downgrade jackrabbit in 2.x

2024-02-12 Thread Tim Allison (Jira)
Tim Allison created TIKA-4197:
-

 Summary: Downgrade jackrabbit in 2.x
 Key: TIKA-4197
 URL: https://issues.apache.org/jira/browse/TIKA-4197
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison


Looks like the latest jackrabbit requires Java 11: 
https://github.com/apache/tika/actions/runs/7875864667/job/21488695827



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4195:
--
Description: 
The JSoupParser runs encoding detection on the InputStream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.

  was:
The JSoupParser is runs encoding detection on the inputstream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.


> JSoupParser conceals null from the EncodingDetector
> ---
>
> Key: TIKA-4195
> URL: https://issues.apache.org/jira/browse/TIKA-4195
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> The JSoupParser runs encoding detection on the InputStream. If the result is 
> null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4195.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> JSoupParser conceals null from the EncodingDetector
> ---
>
> Key: TIKA-4195
> URL: https://issues.apache.org/jira/browse/TIKA-4195
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 3.0.0
>
>
> The JSoupParser is runs encoding detection on the inputstream. If the result 
> is null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4195 -- jsoup parser shouldn't conceal backoff to default encoding [tika]

2024-02-12 Thread via GitHub


tballison merged PR #1591:
URL: https://github.com/apache/tika/pull/1591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816694#comment-17816694
 ] 

ASF GitHub Bot commented on TIKA-4195:
--

tballison merged PR #1591:
URL: https://github.com/apache/tika/pull/1591




> JSoupParser conceals null from the EncodingDetector
> ---
>
> Key: TIKA-4195
> URL: https://issues.apache.org/jira/browse/TIKA-4195
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> The JSoupParser is runs encoding detection on the inputstream. If the result 
> is null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816689#comment-17816689
 ] 

Tim Allison commented on TIKA-4194:
---

Merged and cherry-picked into branch_2x.

[~tom_1st] if you do have time to look at TIKA-3784, I'd be interested if you 
see any value in parsing these files with bouncycastle as a detector.  Thank 
you!

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816687#comment-17816687
 ] 

ASF GitHub Bot commented on TIKA-4194:
--

tballison merged PR #1589:
URL: https://github.com/apache/tika/pull/1589




> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |19|OK|APPLICATION/X-X509-KEY; 
> 

Re: [PR] [TIKA-4194] Fix for unrecognized pkcs12 keystores [tika]

2024-02-12 Thread via GitHub


tballison merged PR #1589:
URL: https://github.com/apache/tika/pull/1589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4196) Add a BOM charset detector

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816680#comment-17816680
 ] 

ASF GitHub Bot commented on TIKA-4196:
--

tballison merged PR #1590:
URL: https://github.com/apache/tika/pull/1590




> Add a BOM charset detector
> --
>
> Key: TIKA-4196
> URL: https://issues.apache.org/jira/browse/TIKA-4196
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>
> The ICU4j and the StandardHtmlEncodingDetector detectors include a bom 
> detector, but for some use cases it would be useful to factor that out and 
> allow users to configure bom detection on their own.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4196 [tika]

2024-02-12 Thread via GitHub


tballison merged PR #1590:
URL: https://github.com/apache/tika/pull/1590


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816679#comment-17816679
 ] 

ASF GitHub Bot commented on TIKA-4195:
--

tballison opened a new pull request, #1591:
URL: https://github.com/apache/tika/pull/1591

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> JSoupParser conceals null from the EncodingDetector
> ---
>
> Key: TIKA-4195
> URL: https://issues.apache.org/jira/browse/TIKA-4195
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
>
> The JSoupParser is runs encoding detection on the inputstream. If the result 
> is null, the parser applies the default charset -- US-ASCII. This behavior is 
> ok. 
> The problem is that there is no way to distinguish when a faulty encoding 
> detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
> don't think the JSoupParser should report the fallback encoding as if it were 
> detected.
> I'm not sure how best to report this in the metadata, but we need to be able 
> to differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4195 -- jsoup parser shouldn't conceal backoff to default encoding [tika]

2024-02-12 Thread via GitHub


tballison opened a new pull request, #1591:
URL: https://github.com/apache/tika/pull/1591

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (TIKA-4196) Add a BOM charset detector

2024-02-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4196:
--
Description: The ICU4j and the StandardHtmlEncodingDetector detectors 
include a bom detector, but for some use cases it would be useful to factor 
that out and allow users to configure bom detection on their own.  (was: The 
ICU4j detector uses a bom detector, but for some use cases it would be useful 
to factor that out and allow users to configure bom detection on their own.)

> Add a BOM charset detector
> --
>
> Key: TIKA-4196
> URL: https://issues.apache.org/jira/browse/TIKA-4196
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>
> The ICU4j and the StandardHtmlEncodingDetector detectors include a bom 
> detector, but for some use cases it would be useful to factor that out and 
> allow users to configure bom detection on their own.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4196) Add a BOM charset detector

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816668#comment-17816668
 ] 

ASF GitHub Bot commented on TIKA-4196:
--

tballison opened a new pull request, #1590:
URL: https://github.com/apache/tika/pull/1590

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add a BOM charset detector
> --
>
> Key: TIKA-4196
> URL: https://issues.apache.org/jira/browse/TIKA-4196
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Trivial
>
> The ICU4j detector uses a bom detector, but for some use cases it would be 
> useful to factor that out and allow users to configure bom detection on their 
> own.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4196 [tika]

2024-02-12 Thread via GitHub


tballison opened a new pull request, #1590:
URL: https://github.com/apache/tika/pull/1590

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (TIKA-4196) Add a BOM charset detector

2024-02-12 Thread Tim Allison (Jira)
Tim Allison created TIKA-4196:
-

 Summary: Add a BOM charset detector
 Key: TIKA-4196
 URL: https://issues.apache.org/jira/browse/TIKA-4196
 Project: Tika
  Issue Type: New Feature
Reporter: Tim Allison


The ICU4j detector uses a bom detector, but for some use cases it would be 
useful to factor that out and allow users to configure bom detection on their 
own.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816661#comment-17816661
 ] 

Tim Allison commented on TIKA-4194:
---

Thank you for this! I'll try to take a look later today.

Is there anything in 
https://issues.apache.org/jira/browse/TIKA-3784?focusedCommentId=17816191=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17816191
 that could be useful? IIUC, magic is doubtful for this file type. On that 
comment I ran the bouncycastle parser on the files and pulled out some info. 
Can we use that info for detection?

Again, thank you, and I'm not necessarily against a "better than what we have" 
magic guessing approach.

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Created] (TIKA-4195) JSoupParser conceals null from the EncodingDetector

2024-02-12 Thread Tim Allison (Jira)
Tim Allison created TIKA-4195:
-

 Summary: JSoupParser conceals null from the EncodingDetector
 Key: TIKA-4195
 URL: https://issues.apache.org/jira/browse/TIKA-4195
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


The JSoupParser is runs encoding detection on the inputstream. If the result is 
null, the parser applies the default charset -- US-ASCII. This behavior is ok. 

The problem is that there is no way to distinguish when a faulty encoding 
detector alleges 'US-ASCII' and the default behavior of the JSoupParser. I 
don't think the JSoupParser should report the fallback encoding as if it were 
detected.

I'm not sure how best to report this in the metadata, but we need to be able to 
differentiate detection and fallback encoding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816607#comment-17816607
 ] 

Lonzak edited comment on TIKA-4194 at 2/12/24 1:47 PM:
---

Interestingly the "application/pkcs7-signature" type looks quite similar:
{code:java}

  
         
      
      
         
      
      
         
      
      
         
      
      
         
      
 
{code}
Just had to adapt the offset a bit and and did work:
{code:java}
  
         
      
      
         
      {code}
However I didn't find a keystore with 0x3081 so the offset is unclear in that 
case. My solution would look like this now and works for all the cases...
{code:java}
    
      
     ...
    {code}


was (Author: tom_1st):
Interestingly the "application/pkcs7-signature" type looks quite similar:

 

 
{code:java}

  
         
      
      
         
      
      
         
      
      
         
      
      
         
      
 
{code}
 

Just had to adapt the offset a bit and and did work:

 
{code:java}
  
         
      
      
         
      {code}
 

 

However I didn't find a keystore with 0x3081 so the offset is unclear in that 
case. My solution would look like this now and works for all the cases...

 
{code:java}
    
      
     ...
    {code}
 

 

 

 

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816607#comment-17816607
 ] 

Lonzak commented on TIKA-4194:
--

Interestingly the "application/pkcs7-signature" type looks quite similar:

 

 
{code:java}

  
         
      
      
         
      
      
         
      
      
         
      
      
         
      
 
{code}
 

Just had to adapt the offset a bit and and did work:

 
{code:java}
  
         
      
      
         
      {code}
 

 

However I didn't find a keystore with 0x3081 so the offset is unclear in that 
case. My solution would look like this now and works for all the cases...

 
{code:java}
    
      
     ...
    {code}
 

 

 

 

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816608#comment-17816608
 ] 

Lonzak commented on TIKA-4194:
--

Added a pull request: https://github.com/apache/tika/pull/1589

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |19|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816606#comment-17816606
 ] 

ASF GitHub Bot commented on TIKA-4194:
--

Lonzak commented on PR #1589:
URL: https://github.com/apache/tika/pull/1589#issuecomment-1938709322

   It would appreciated if the change could go into 2.9.X 
([branch_2x](https://github.com/apache/tika/tree/branch_2x))




> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> 

Re: [PR] [TIKA-4194] Fix for unrecognized pkcs12 keystores [tika]

2024-02-12 Thread via GitHub


Lonzak commented on PR #1589:
URL: https://github.com/apache/tika/pull/1589#issuecomment-1938709322

   It would appreciated if the change could go into 2.9.X 
([branch_2x](https://github.com/apache/tika/tree/branch_2x))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816605#comment-17816605
 ] 

ASF GitHub Bot commented on TIKA-4194:
--

Lonzak opened a new pull request, #1589:
URL: https://github.com/apache/tika/pull/1589

   Fixes the issue that some pkcs12 keystores are not correctly detected. 
Tested with 157x p12 keystores (kindly provided by 
[redhat](https://github.com/redhat-qe-security/keyfile-corpus/tree/master).)




> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
>