On May 25, 2023, at 21:13, Tom Hubbard <cubb...@gmail.com> wrote: > An InChI could not be generated and used to canonise SMILES: null > > Could not generate InChI Numbers: Too many atoms [did you forget > 'LargeMolecules' switch?]
CDK uses InChI to generate absolute SMILES. Here's a comment from the code: * Create a absolute SMILES generator. Unique SMILES uses the InChI to * canonise SMILES and encodes isotope or stereo-chemistry. The InChI * module is not a dependency of the SMILES module but should be present * on the classpath when generation absolute SMILES. If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag from your output flavor then you'll get a SMILES, though it won't be an absolute SMILES. More specifically, CDK uses InChI to generate the atom labels used during canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code path which looks like: // apply the canonical labelling if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) { // determine the output order int[] labels = labels(flavour, molecule); where the labels() is: private static int[] labels(int flavour, final IAtomContainer molecule) throws CDKException { // FIXME: use SmiOpt.InChiLabelling long[] labels = SmiFlavor.isSet(flavour, SmiFlavor.Isomeric) ? inchiNumbers(molecule) : Canon.label(molecule, GraphUtil.toAdjList(molecule), createComparator(molecule, flavour)); Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to do the atom order assignments, via the 'auxiliary information': public static long[] getNumbers(IAtomContainer atomContainer) throws CDKException { String aux = auxInfo(atomContainer, new InchiFlag[0]); ... static String auxInfo(IAtomContainer container, InchiFlag... flags) throws CDKException { InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance(); boolean org = factory.getIgnoreAromaticBonds(); factory.setIgnoreAromaticBonds(true); InChIGenerator gen = factory.getInChIGenerator(container, flags); factory.setIgnoreAromaticBonds(org); // an option on the singleton so we should reset for others if (gen.getStatus() == InchiStatus.ERROR) throw new CDKException("Could not generate InChI Numbers: " + gen.getMessage()); return gen.getAuxInfo(); That calls into the InChI, which has the check (actually, it's in a few places, all with the same idea): max_num_at = ip->bLargeMolecules ? MAX_ATOMS : NORMALLY_ALLOWED_INP_MAX_ATOMS; if (nNumAtoms >= max_num_at) { TREAT_ERR( *err, 0, "Too many atoms [did you forget 'LargeMolecules' switch?]" ); *err = 70; orig_inp_data->num_inp_atoms = -1; goto err_exit; } where #define MAX_ATOMS 32766 #define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024 The InChI flag is enabled with the flag 'LargeMolecules', https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47 /** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating beta status of resulting identifiers]*/ so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49 from: String aux = auxInfo(atomContainer, new InchiFlag[0]); to have LargeMolecules in that 'new InchiFlag' would make this work. However, I'm not a Java developer and don't know how to make this change nor test it. I can say it does not seem to be user-configurable. I am a Python developer, and I can reproduce the error using my 'chemfp translate' tool, which uses a Java/Python bridge to work with the CDK. The following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms: % python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --out sdf | head -6 megatryp RDKit 0 0 0 0 0 0 0 0 0 0999 V3000 M V30 BEGIN CTAB M V30 COUNTS 1079 1232 0 0 0 I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to produce the SMILES generation failure: % python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --via sdf -U cdk --out smi Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI could not be generated and used to canonise SMILES: null, file '<stdin>', line 1, record #1: first line is '>megatryp'. Skipping. (the --via defaults to 'sdf' so I'll omit that in the rest). I can configure CDK SMILES writer to use the Default flavor, but without the 'Canonical' option, to show that work-around gives a (non-canonical) SMILES: % python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2 NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O) NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O) Here I'll disable Isomeric instead, so it should be canonical but not isomeric, which might be okay for you: % python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2 O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C( NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C( That's the flavor you pass into SmilesGenerator(). Cheers, Andrew da...@dalkescientific.com _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user