Author: sascha-guest Date: 2014-05-15 22:25:00 +0000 (Thu, 15 May 2014) New Revision: 16950
Added: trunk/packages/genometools/trunk/debian/patches/split-manuals Removed: trunk/packages/genometools/trunk/debian/patches/split_manuals Modified: trunk/packages/genometools/trunk/debian/changelog trunk/packages/genometools/trunk/debian/patches/series Log: Address LaTeX build problems for manuals Modified: trunk/packages/genometools/trunk/debian/changelog =================================================================== --- trunk/packages/genometools/trunk/debian/changelog 2014-05-15 14:30:23 UTC (rev 16949) +++ trunk/packages/genometools/trunk/debian/changelog 2014-05-15 22:25:00 UTC (rev 16950) @@ -1,3 +1,9 @@ +genometools (1.5.2-2) unstable; urgency=low + + * Split manuals into separate documents to avoid strange LaTeX build issues. + + -- Sascha Steinbiss <[email protected]> Thu, 15 May 2014 23:23:20 +0000 + genometools (1.5.2-1) unstable; urgency=low * New upstream release. Modified: trunk/packages/genometools/trunk/debian/patches/series =================================================================== --- trunk/packages/genometools/trunk/debian/patches/series 2014-05-15 14:30:23 UTC (rev 16949) +++ trunk/packages/genometools/trunk/debian/patches/series 2014-05-15 22:25:00 UTC (rev 16950) @@ -7,3 +7,4 @@ mips-64 no-xmllint spelling +split-manuals Added: trunk/packages/genometools/trunk/debian/patches/split-manuals =================================================================== --- trunk/packages/genometools/trunk/debian/patches/split-manuals (rev 0) +++ trunk/packages/genometools/trunk/debian/patches/split-manuals 2014-05-15 22:25:00 UTC (rev 16950) @@ -0,0 +1,410 @@ +Description: Split manuals into separate documents to avoid strange LaTeX build issues. +Author: Sascha Steinbiss <[email protected]> +--- a/doc/manuals/matstat.tex ++++ b/doc/manuals/matstat.tex +@@ -1,2 +1,160 @@ +-\def\BuildMatstat{} +-\input{uniquesub} ++\documentclass[12pt]{article} ++\usepackage[a4paper,top=20mm,bottom=20mm,left=20mm,right=20mm]{geometry} ++\usepackage{url} ++\usepackage{alltt} ++\usepackage{xspace} ++\usepackage{times} ++\usepackage{listings} ++ ++\usepackage{verbatim} ++\usepackage{ifthen} ++\usepackage{optionman} ++ ++\newcommand{\Substring}[3]{#1[#2..#3]} ++ ++\newcommand{\Program}[0]{\texttt{matstat}\xspace} ++\newcommand{\MS}[1]{\mathit{ms(s,#1)}} ++\title{\Program: a program for computing\\ ++ matching statistics\\ ++ a manual} ++ ++\author{\begin{tabular}{c} ++ \textit{Stefan Kurtz}\\ ++ Center for Bioinformatics,\\ ++ University of Hamburg ++ \end{tabular}} ++ ++\begin{document} ++\maketitle ++ ++\section{The program \Program} ++ ++The program \Program is called as follows: ++\par ++\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] ++\par ++\Showoptionarg{files} is a white space separated list of at least one ++filename. Any sequence occurring in any file specified in \Showoptionarg{files} ++is called \textit{unit} in the following. ++In addition to the mandatory option \Showoption{query}, the program ++must be called with either option \Showoption{pck} or \Showoption{esa} ++which specify to use a packed index or an enhanced suffix array for ++a given set of subject sequences. ++ ++\Program computes the \textit{matching statistics} for each unit. That is, ++for each position \(i\) in ++each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here ++\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches ++a substring represented by the index and \(j\) is a start position of the ++matching substring in the index. We say that \(l\) is the length of \(\MS{i}\) ++and \(j\) is the subject position of \(\MS{i}\). ++ ++The following options are available in \Program: ++ ++\begin{Justshowoptions} ++\begin{comment} ++\Option{fmi}{$\Showoptionarg{indexname}$}{ ++Use the old implementation of the FMindex. This option is not recommended. ++} ++\end{comment} ++ ++\Option{esa}{$\Showoptionarg{indexname}$}{ ++Use the given enhanced suffix array to compute the matches. ++} ++ ++\Option{pck}{$\Showoptionarg{indexname}$}{ ++Use the packed index (an efficient representation of the FMindex) ++to compute the matches. ++} ++ ++ ++\Option{query}{$\Showoptionarg{files}$}{ ++Specify a white space separated list of query files containing the units. ++At least one query file must be given. The files may be in ++gzipped format, in which case they have to end with the suffix \texttt{.gz}. ++} ++ ++\Option{min}{$\ell$}{ ++Specify the minimum value $\ell$ for the length of the matching statistics. ++That is, for each unit \(s\) and each position \(i\) in \(s\), the program ++reports all values \(i\) and \(\MS{i}\) if the ++length of \(\MS{i}\) is at least \(\ell\). ++} ++ ++\Option{max}{$\ell$}{ ++Specify the maximum length $\ell$ for the length of the matching statistics. ++That is, for each unit \(s\) and each positions \(i\) in \(s\), the program ++reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\) ++is at most \(\ell\). ++} ++ ++\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ ++Specify what to output. At least one of the three keys words ++$\Showoptionkey{subjectpos}$, ++$\Showoptionkey{querypos}$, and ++$\Showoptionkey{sequence}$ must be used. ++Using the keyword $\Showoptionkey{subjectpos}$ shows the ++subject position of the matching statistics. ++Using the keyword $\Showoptionkey{querypos}$ shows the query position. ++Using the keyword $\Showoptionkey{sequence}$ shows the sequence content ++} ++ ++\Helpoption ++ ++\end{Justshowoptions} ++The following conditions must be satisfied: ++\begin{enumerate} ++\item ++Either option \Showoption{min} or option \Showoption{max} must be used. ++\item ++If both options \Showoption{min} and \Showoption{max} are used, then ++the value specified by option \(\Showoption{min}\) must be smaller ++than the value specified by option \(\Showoption{max}\). ++\item ++Either option \Showoption{pck} or \Showoption{esa} must be used. Both cannot ++be combined. ++\end{enumerate} ++ ++\section{Examples} ++ ++Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped ++fasta files containing all 24 human chromomsomes plus one file with ++mitrochondrial sequences. These may have been downloaded from ++\url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}. ++ ++In the first step, we construct the packed index for the entire genome: ++ ++\begin{Output} ++gt packedindex mkindex -dna -dir rev -parts 15 -bsize 10 -locfreq 32 ++ -indexname human-all -db homo-sapiens/*.gz ++\end{Output} ++ ++The program runs for almost two hours and delivers ++an index \texttt{human-all} consisting of three files: ++ ++\begin{Output} ++ls -lh human-all.* ++-rw-r----- 1 kurtz gistaff 37 2008-01-24 00:47 human-all.al1 ++-rw-r----- 1 kurtz gistaff 1.9G 2008-01-24 02:37 human-all.bdx ++-rw-r----- 1 kurtz gistaff 3.4K 2008-01-24 02:37 human-all.prj ++\end{Output} ++ ++This is used in the following call to the program \Program: ++ ++\begin{Output} ++gt matstat -output subjectpos querypos sequence -min 20 -max 30 ++ -query queryfile.fna -pck human-all ++unit 0 (Mus musculus, chr 1, complete sequence) ++22 20 390765125 actgtatctcaaaatataaa ++253 21 258488266 gggaataaacatgtcattgag ++254 20 258488267 ggaataaacatgtcattgag ++275 20 900483549 taattctatttttctttctt ++480 20 1008274536 gcttgaagatcatgatccag ++.. ++\end{Output} ++Here, the first column shows the relative positions in unit 0 for which the ++length of the matching statistics is between 20 and 30. The second column is ++the corresponding length value. The third column shows position of the ++matching sequence in the index, and the fourth shows the sequence content. ++ ++\end{document} +--- a/doc/manuals/uniquesub.tex ++++ b/doc/manuals/uniquesub.tex +@@ -8,38 +8,15 @@ + + \usepackage{verbatim} + \usepackage{ifthen} +-\usepackage{comment} + \usepackage{optionman} + +-\ifthenelse{\isundefined{\BuildMatstat}}{% +-\includecomment{AboutUniquesub} +-\excludecomment{AboutMatstat} +-\newcommand{\AboutUniquesubcmd}[1]{#1} +-\newcommand{\AboutMatstatcmd}[1]{} +-}{% +-\includecomment{AboutMatstat} +-\excludecomment{AboutUniquesub} +-\newcommand{\AboutMatstatcmd}[1]{#1} +-\newcommand{\AboutUniquesubcmd}[1]{} +-} +- + \newcommand{\Substring}[3]{#1[#2..#3]} + +-\begin{AboutUniquesub} + \newcommand{\Program}[0]{\texttt{uniquesub}\xspace} + \newcommand{\Mup}[1]{\mathit{mup(s,#1)}} + \title{\Program: a program for computing\\ + minimum unique substrings\\ + a manual} +-\end{AboutUniquesub} +- +-\begin{AboutMatstat} +-\newcommand{\Program}[0]{\texttt{matstat}\xspace} +-\newcommand{\MS}[1]{\mathit{ms(s,#1)}} +-\title{\Program: a program for computing\\ +- matching statistics\\ +- a manual} +-\end{AboutMatstat} + + \author{\begin{tabular}{c} + \textit{Stefan Kurtz}\\ +@@ -54,47 +31,35 @@ + + The program \Program is called as follows: + \par +-\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] ++\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] + \par +-\Showoptionarg{files} is a white space separated list of at least one ++\Showoptionarg{files} is a white space separated list of at least one + filename. Any sequence occurring in any file specified in \Showoptionarg{files} + is called \textit{unit} in the following. + In addition to the mandatory option \Showoption{query}, the program + must be called with either option \Showoption{pck} or \Showoption{esa} +-which specify to use a packed index or an enhanced suffix array for ++which specify to use a packed index or an enhanced suffix array for + a given set of subject sequences. + +-\begin{AboutUniquesub} + \Program computes for all positions \(i\) in each unit, say \(s\) of length +-\(n\), the length \(\Mup{i}\) of the minimum unique prefix ++\(n\), the length \(\Mup{i}\) of the minimum unique prefix + at position \(i\), if it exists. Uniqueness always refers to all substrings +-represented by the index. \(\Mup{i}\) is defined by the following two ++represented by the index. \(\Mup{i}\) is defined by the following two + statements: + \begin{itemize} + \item + If \(\Substring{s}{i}{n-1}\) is not unique in the index, then \(\Mup{i}=\bot\). + That is, it is undefined. + \item +-If \(\Substring{s}{i}{n-1}\) is unique in the index, then \(\Mup{i}=m\), where +-\(m\) is the smallest value such that \(i+m-1\leq n-1\) and ++If \(\Substring{s}{i}{n-1}\) is unique in the index, then \(\Mup{i}=m\), where ++\(m\) is the smallest value such that \(i+m-1\leq n-1\) and + \(\Substring{s}{i}{i+m-1}\) occurs exactly once as a substring in the index. + \end{itemize} +-Note that it is possible that for all \(i\in[0,n-1]\) we have +-\(\Mup{i}=\bot\), which means that unit \(s\) does not contain any unique ++Note that it is possible that for all \(i\in[0,n-1]\) we have ++\(\Mup{i}=\bot\), which means that unit \(s\) does not contain any unique + substring. In this case, the program reports nothing for the corresponding + unit. The program was developed for designing whole genome tiling arrays. + The corresponding publication is \cite{GRAE:NIE:KUR:HUY:BIR:STU:FLI:2007}. +-\end{AboutUniquesub} +- +-\begin{AboutMatstat} +-\Program computes the \textit{matching statistics} for each unit. That is, +-for each position \(i\) in +-each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here +-\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches +-a substring represented by the index and \(j\) is a start position of the +-matching substring in the index. We say that \(l\) is the length of \(\MS{i}\) +-and \(j\) is the subject position of \(\MS{i}\). +-\end{AboutMatstat} + + The following options are available in \Program: + +@@ -117,58 +82,29 @@ + + \Option{query}{$\Showoptionarg{files}$}{ + Specify a white space separated list of query files containing the units. +-At least one query file must be given. The files may be in ++At least one query file must be given. The files may be in + gzipped format, in which case they have to end with the suffix \texttt{.gz}. + } + +-\begin{AboutUniquesub} + \Option{min}{$\ell$}{ + Specify the minimum length $\ell$ of the minimum unique prefixes. +-That is, for each unit \(s\) and each positions \(i\) in \(s\), the program +-reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\geq\ell\). ++That is, for each unit \(s\) and each positions \(i\) in \(s\), the program ++reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\geq\ell\). + } + + \Option{max}{$\ell$}{ + Specify the maximum length $\ell$ of the minimum unique prefixes. +-That is, for each unit \(s\) and each positions \(i\) in \(s\), the program ++That is, for each unit \(s\) and each positions \(i\) in \(s\), the program + reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\leq\ell\). + } + + \Option{output}{(\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ + Specify what to output. At least one of the two keys words +-$\Showoptionkey{querypos}$ and $\Showoptionkey{sequence}$ must be used. ++$\Showoptionkey{querypos}$ and $\Showoptionkey{sequence}$ must be used. + Using the keyword $\Showoptionkey{querypos}$ shows the query position. + Using the keyword $\Showoptionkey{sequence}$ shows the sequence content + of the match. + } +-\end{AboutUniquesub} +- +-\begin{AboutMatstat} +-\Option{min}{$\ell$}{ +-Specify the minimum value $\ell$ for the length of the matching statistics. +-That is, for each unit \(s\) and each position \(i\) in \(s\), the program +-reports all values \(i\) and \(\MS{i}\) if the +-length of \(\MS{i}\) is at least \(\ell\). +-} +- +-\Option{max}{$\ell$}{ +-Specify the maximum length $\ell$ for the length of the matching statistics. +-That is, for each unit \(s\) and each positions \(i\) in \(s\), the program +-reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\) +-is at most \(\ell\). +-} +- +-\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ +-Specify what to output. At least one of the three keys words +-$\Showoptionkey{subjectpos}$, +-$\Showoptionkey{querypos}$, and +-$\Showoptionkey{sequence}$ must be used. +-Using the keyword $\Showoptionkey{subjectpos}$ shows the +-subject position of the matching statistics. +-Using the keyword $\Showoptionkey{querypos}$ shows the query position. +-Using the keyword $\Showoptionkey{sequence}$ shows the sequence content +-} +-\end{AboutMatstat} + + \Helpoption + +@@ -189,7 +125,7 @@ + \section{Examples} + + Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped +-fasta files containing all 24 human chromomsomes plus one file with ++fasta files containing all 24 human chromomsomes plus one file with + mitrochondrial sequences. These may have been downloaded from + \url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}. + +@@ -200,7 +136,7 @@ + -indexname human-all -db homo-sapiens/*.gz + \end{Output} + +-The program runs for almost two hours and delivers ++The program runs for almost two hours and delivers + an index \texttt{human-all} consisting of three files: + + \begin{Output} +@@ -212,9 +148,8 @@ + + This is used in the following call to the program \Program: + +-\begin{AboutUniquesub} + \begin{Output} +-gt uniquesub -output querypos -min 20 -max 30 -query queryfile.fna ++gt uniquesub -output querypos -min 20 -max 30 -query queryfile.fna + -pck human-all + unit 0 (Mus musculus, chr 1, complete sequence) + 1007 20 +@@ -227,7 +162,7 @@ + + For all units \(s\) in the multiple \Fasta file \texttt{queryfile.fna}, + a line is shown, reporting the number of the unit and the original fasta +-header. Also, all for positions \(i\) in \(s\) satisfying ++header. Also, all for positions \(i\) in \(s\) satisfying + \(20\leq \Mup{i}\leq 30\), \(i\) and \(\Mup{i}\) is reported. + + The first column is the relative position in the unit sequence (counting +@@ -238,7 +173,7 @@ + \Showoption{output}: + + \begin{Output} +-gt uniquesub -output querypos sequence -min 20 -max 30 ++gt uniquesub -output querypos sequence -min 20 -max 30 + -query queryfile.fna -pck human-all + unit 0 (Mus musculus, chr 1, complete sequence) + 1007 20 ctgacagtttttttttttta +@@ -248,27 +183,7 @@ + 1013 21 gttttttttttttactttata + ... + \end{Output} +-\end{AboutUniquesub} +- +-\begin{AboutMatstat} +-\begin{Output} +-gt matstat -output subjectpos querypos sequence -min 20 -max 30 +- -query queryfile.fna -pck human-all +-unit 0 (Mus musculus, chr 1, complete sequence) +-22 20 390765125 actgtatctcaaaatataaa +-253 21 258488266 gggaataaacatgtcattgag +-254 20 258488267 ggaataaacatgtcattgag +-275 20 900483549 taattctatttttctttctt +-480 20 1008274536 gcttgaagatcatgatccag +-.. +-\end{Output} +-Here, the first column shows the relative positions in unit 0 for which the +-length of the matching statistics is between 20 and 30. The second column is +-the corresponding length value. The third column shows position of the +-matching sequence in the index, and the fourth shows the sequence content. +-\end{AboutMatstat} + +-\begin{AboutUniquesub} + %\bibliographystyle{plain} + %\bibliography{defines,kurtz} + \begin{thebibliography}{1} +@@ -280,5 +195,4 @@ + \newblock {\em {Bioinformatics}}, {23 ISMB/ECCB 2007}:{i195--i204}, 2007. + + \end{thebibliography} +-\end{AboutUniquesub} + \end{document} Deleted: trunk/packages/genometools/trunk/debian/patches/split_manuals =================================================================== --- trunk/packages/genometools/trunk/debian/patches/split_manuals 2014-05-15 14:30:23 UTC (rev 16949) +++ trunk/packages/genometools/trunk/debian/patches/split_manuals 2014-05-15 22:25:00 UTC (rev 16950) @@ -1,409 +0,0 @@ -Description: split manuals into individual LaTeX files - On some platforms, LaTeX will not build manuals properly when - they share a single file in which text sections are marked or - unmarked as comments using includecomment/excludecomment. -Author: Sascha Steinbiss <[email protected]> ---- a/doc/manuals/matstat.tex -+++ b/doc/manuals/matstat.tex -@@ -1,2 +1,158 @@ --\def\BuildMatstat{} --\input{uniquesub} -+\documentclass[12pt]{article} -+\usepackage[a4paper,top=20mm,bottom=20mm,left=20mm,right=20mm]{geometry} -+\usepackage{url} -+\usepackage{alltt} -+\usepackage{xspace} -+\usepackage{times} -+\usepackage{listings} -+\usepackage{verbatim} -+\usepackage{optionman} -+ -+\newcommand{\Substring}[3]{#1[#2..#3]} -+ -+\newcommand{\Program}[0]{\texttt{matstat}\xspace} -+\newcommand{\MS}[1]{\mathit{ms(s,#1)}} -+\title{\Program: a program for computing\\ -+ matching statistics\\ -+ a manual} -+ -+\author{\begin{tabular}{c} -+ \textit{Stefan Kurtz}\\ -+ Center for Bioinformatics,\\ -+ University of Hamburg -+ \end{tabular}} -+ -+\begin{document} -+\maketitle -+ -+\section{The program \Program} -+ -+The program \Program is called as follows: -+\par -+\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] -+\par -+\Showoptionarg{files} is a white space separated list of at least one -+filename. Any sequence occurring in any file specified in \Showoptionarg{files} -+is called \textit{unit} in the following. -+In addition to the mandatory option \Showoption{query}, the program -+must be called with either option \Showoption{pck} or \Showoption{esa} -+which specify to use a packed index or an enhanced suffix array for -+a given set of subject sequences. -+ -+\Program computes the \textit{matching statistics} for each unit. That is, -+for each position \(i\) in -+each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here -+\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches -+a substring represented by the index and \(j\) is a start position of the -+matching substring in the index. We say that \(l\) is the length of \(\MS{i}\) -+and \(j\) is the subject position of \(\MS{i}\). -+ -+The following options are available in \Program: -+ -+\begin{Justshowoptions} -+\begin{comment} -+\Option{fmi}{$\Showoptionarg{indexname}$}{ -+Use the old implementation of the FMindex. This option is not recommended. -+} -+\end{comment} -+ -+\Option{esa}{$\Showoptionarg{indexname}$}{ -+Use the given enhanced suffix array to compute the matches. -+} -+ -+\Option{pck}{$\Showoptionarg{indexname}$}{ -+Use the packed index (an efficient representation of the FMindex) -+to compute the matches. -+} -+ -+ -+\Option{query}{$\Showoptionarg{files}$}{ -+Specify a white space separated list of query files containing the units. -+At least one query file must be given. The files may be in -+gzipped format, in which case they have to end with the suffix \texttt{.gz}. -+} -+ -+\Option{min}{$\ell$}{ -+Specify the minimum value $\ell$ for the length of the matching statistics. -+That is, for each unit \(s\) and each position \(i\) in \(s\), the program -+reports all values \(i\) and \(\MS{i}\) if the -+length of \(\MS{i}\) is at least \(\ell\). -+} -+ -+\Option{max}{$\ell$}{ -+Specify the maximum length $\ell$ for the length of the matching statistics. -+That is, for each unit \(s\) and each positions \(i\) in \(s\), the program -+reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\) -+is at most \(\ell\). -+} -+ -+\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ -+Specify what to output. At least one of the three keys words -+$\Showoptionkey{subjectpos}$, -+$\Showoptionkey{querypos}$, and -+$\Showoptionkey{sequence}$ must be used. -+Using the keyword $\Showoptionkey{subjectpos}$ shows the -+subject position of the matching statistics. -+Using the keyword $\Showoptionkey{querypos}$ shows the query position. -+Using the keyword $\Showoptionkey{sequence}$ shows the sequence content -+} -+ -+\Helpoption -+ -+\end{Justshowoptions} -+The following conditions must be satisfied: -+\begin{enumerate} -+\item -+Either option \Showoption{min} or option \Showoption{max} must be used. -+\item -+If both options \Showoption{min} and \Showoption{max} are used, then -+the value specified by option \(\Showoption{min}\) must be smaller -+than the value specified by option \(\Showoption{max}\). -+\item -+Either option \Showoption{pck} or \Showoption{esa} must be used. Both cannot -+be combined. -+\end{enumerate} -+ -+\section{Examples} -+ -+Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped -+fasta files containing all 24 human chromomsomes plus one file with -+mitrochondrial sequences. These may have been downloaded from -+\url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}. -+ -+In the first step, we construct the packed index for the entire genome: -+ -+\begin{Output} -+gt packedindex mkindex -dna -dir rev -parts 15 -bsize 10 -locfreq 32 -+ -indexname human-all -db homo-sapiens/*.gz -+\end{Output} -+ -+The program runs for almost two hours and delivers -+an index \texttt{human-all} consisting of three files: -+ -+\begin{Output} -+ls -lh human-all.* -+-rw-r----- 1 kurtz gistaff 37 2008-01-24 00:47 human-all.al1 -+-rw-r----- 1 kurtz gistaff 1.9G 2008-01-24 02:37 human-all.bdx -+-rw-r----- 1 kurtz gistaff 3.4K 2008-01-24 02:37 human-all.prj -+\end{Output} -+ -+This is used in the following call to the program \Program: -+ -+\begin{Output} -+gt matstat -output subjectpos querypos sequence -min 20 -max 30 -+ -query queryfile.fna -pck human-all -+unit 0 (Mus musculus, chr 1, complete sequence) -+22 20 390765125 actgtatctcaaaatataaa -+253 21 258488266 gggaataaacatgtcattgag -+254 20 258488267 ggaataaacatgtcattgag -+275 20 900483549 taattctatttttctttctt -+480 20 1008274536 gcttgaagatcatgatccag -+.. -+\end{Output} -+Here, the first column shows the relative positions in unit 0 for which the -+length of the matching statistics is between 20 and 30. The second column is -+the corresponding length value. The third column shows position of the -+matching sequence in the index, and the fourth shows the sequence content. -+\end{document} ---- a/doc/manuals/uniquesub.tex -+++ b/doc/manuals/uniquesub.tex -@@ -7,37 +7,14 @@ --\usepackage{ifthen} --\usepackage{comment} - \usepackage{optionman} -- --\ifthenelse{\isundefined{\BuildMatstat}}{% --\includecomment{AboutUniquesub} --\excludecomment{AboutMatstat} --\newcommand{\AboutUniquesubcmd}[1]{#1} --\newcommand{\AboutMatstatcmd}[1]{} --}{% --\includecomment{AboutMatstat} --\excludecomment{AboutUniquesub} --\newcommand{\AboutMatstatcmd}[1]{#1} --\newcommand{\AboutUniquesubcmd}[1]{} --} -- - \newcommand{\Substring}[3]{#1[#2..#3]} - --\begin{AboutUniquesub} - \newcommand{\Program}[0]{\texttt{uniquesub}\xspace} - \newcommand{\Mup}[1]{\mathit{mup(s,#1)}} - \title{\Program: a program for computing\\ - minimum unique substrings\\ - a manual} --\end{AboutUniquesub} -- --\begin{AboutMatstat} --\newcommand{\Program}[0]{\texttt{matstat}\xspace} --\newcommand{\MS}[1]{\mathit{ms(s,#1)}} --\title{\Program: a program for computing\\ -- matching statistics\\ -- a manual} --\end{AboutMatstat} - - \author{\begin{tabular}{c} - \textit{Stefan Kurtz}\\ -@@ -54,47 +29,35 @@ - - The program \Program is called as follows: - \par --\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] -+\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] - \par --\Showoptionarg{files} is a white space separated list of at least one -+\Showoptionarg{files} is a white space separated list of at least one - filename. Any sequence occurring in any file specified in \Showoptionarg{files} - is called \textit{unit} in the following. - In addition to the mandatory option \Showoption{query}, the program - must be called with either option \Showoption{pck} or \Showoption{esa} --which specify to use a packed index or an enhanced suffix array for -+which specify to use a packed index or an enhanced suffix array for - a given set of subject sequences. - --\begin{AboutUniquesub} - \Program computes for all positions \(i\) in each unit, say \(s\) of length --\(n\), the length \(\Mup{i}\) of the minimum unique prefix -+\(n\), the length \(\Mup{i}\) of the minimum unique prefix - at position \(i\), if it exists. Uniqueness always refers to all substrings --represented by the index. \(\Mup{i}\) is defined by the following two -+represented by the index. \(\Mup{i}\) is defined by the following two - statements: - \begin{itemize} - \item - If \(\Substring{s}{i}{n-1}\) is not unique in the index, then \(\Mup{i}=\bot\). - That is, it is undefined. - \item --If \(\Substring{s}{i}{n-1}\) is unique in the index, then \(\Mup{i}=m\), where --\(m\) is the smallest value such that \(i+m-1\leq n-1\) and -+If \(\Substring{s}{i}{n-1}\) is unique in the index, then \(\Mup{i}=m\), where -+\(m\) is the smallest value such that \(i+m-1\leq n-1\) and - \(\Substring{s}{i}{i+m-1}\) occurs exactly once as a substring in the index. - \end{itemize} --Note that it is possible that for all \(i\in[0,n-1]\) we have --\(\Mup{i}=\bot\), which means that unit \(s\) does not contain any unique -+Note that it is possible that for all \(i\in[0,n-1]\) we have -+\(\Mup{i}=\bot\), which means that unit \(s\) does not contain any unique - substring. In this case, the program reports nothing for the corresponding - unit. The program was developed for designing whole genome tiling arrays. - The corresponding publication is \cite{GRAE:NIE:KUR:HUY:BIR:STU:FLI:2007}. --\end{AboutUniquesub} -- --\begin{AboutMatstat} --\Program computes the \textit{matching statistics} for each unit. That is, --for each position \(i\) in --each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here --\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches --a substring represented by the index and \(j\) is a start position of the --matching substring in the index. We say that \(l\) is the length of \(\MS{i}\) --and \(j\) is the subject position of \(\MS{i}\). --\end{AboutMatstat} - - The following options are available in \Program: - -@@ -117,58 +80,29 @@ - - \Option{query}{$\Showoptionarg{files}$}{ - Specify a white space separated list of query files containing the units. --At least one query file must be given. The files may be in -+At least one query file must be given. The files may be in - gzipped format, in which case they have to end with the suffix \texttt{.gz}. - } - --\begin{AboutUniquesub} - \Option{min}{$\ell$}{ - Specify the minimum length $\ell$ of the minimum unique prefixes. --That is, for each unit \(s\) and each positions \(i\) in \(s\), the program --reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\geq\ell\). -+That is, for each unit \(s\) and each positions \(i\) in \(s\), the program -+reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\geq\ell\). - } - - \Option{max}{$\ell$}{ - Specify the maximum length $\ell$ of the minimum unique prefixes. --That is, for each unit \(s\) and each positions \(i\) in \(s\), the program -+That is, for each unit \(s\) and each positions \(i\) in \(s\), the program - reports the values \(i\) and \(\Mup{i}\) whenever \(\Mup{i}\leq\ell\). - } - - \Option{output}{(\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ - Specify what to output. At least one of the two keys words --$\Showoptionkey{querypos}$ and $\Showoptionkey{sequence}$ must be used. -+$\Showoptionkey{querypos}$ and $\Showoptionkey{sequence}$ must be used. - Using the keyword $\Showoptionkey{querypos}$ shows the query position. - Using the keyword $\Showoptionkey{sequence}$ shows the sequence content - of the match. - } --\end{AboutUniquesub} -- --\begin{AboutMatstat} --\Option{min}{$\ell$}{ --Specify the minimum value $\ell$ for the length of the matching statistics. --That is, for each unit \(s\) and each position \(i\) in \(s\), the program --reports all values \(i\) and \(\MS{i}\) if the --length of \(\MS{i}\) is at least \(\ell\). --} -- --\Option{max}{$\ell$}{ --Specify the maximum length $\ell$ for the length of the matching statistics. --That is, for each unit \(s\) and each positions \(i\) in \(s\), the program --reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\) --is at most \(\ell\). --} -- --\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{ --Specify what to output. At least one of the three keys words --$\Showoptionkey{subjectpos}$, --$\Showoptionkey{querypos}$, and --$\Showoptionkey{sequence}$ must be used. --Using the keyword $\Showoptionkey{subjectpos}$ shows the --subject position of the matching statistics. --Using the keyword $\Showoptionkey{querypos}$ shows the query position. --Using the keyword $\Showoptionkey{sequence}$ shows the sequence content --} --\end{AboutMatstat} - - \Helpoption - -@@ -189,7 +123,7 @@ - \section{Examples} - - Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped --fasta files containing all 24 human chromomsomes plus one file with -+fasta files containing all 24 human chromomsomes plus one file with - mitrochondrial sequences. These may have been downloaded from - \url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}. - -@@ -200,7 +134,7 @@ - -indexname human-all -db homo-sapiens/*.gz - \end{Output} - --The program runs for almost two hours and delivers -+The program runs for almost two hours and delivers - an index \texttt{human-all} consisting of three files: - - \begin{Output} -@@ -212,9 +146,8 @@ - - This is used in the following call to the program \Program: - --\begin{AboutUniquesub} - \begin{Output} --gt uniquesub -output querypos -min 20 -max 30 -query queryfile.fna -+gt uniquesub -output querypos -min 20 -max 30 -query queryfile.fna - -pck human-all - unit 0 (Mus musculus, chr 1, complete sequence) - 1007 20 -@@ -227,7 +160,7 @@ - - For all units \(s\) in the multiple \Fasta file \texttt{queryfile.fna}, - a line is shown, reporting the number of the unit and the original fasta --header. Also, all for positions \(i\) in \(s\) satisfying -+header. Also, all for positions \(i\) in \(s\) satisfying - \(20\leq \Mup{i}\leq 30\), \(i\) and \(\Mup{i}\) is reported. - - The first column is the relative position in the unit sequence (counting -@@ -238,7 +171,7 @@ - \Showoption{output}: - - \begin{Output} --gt uniquesub -output querypos sequence -min 20 -max 30 -+gt uniquesub -output querypos sequence -min 20 -max 30 - -query queryfile.fna -pck human-all - unit 0 (Mus musculus, chr 1, complete sequence) - 1007 20 ctgacagtttttttttttta -@@ -248,27 +181,7 @@ - 1013 21 gttttttttttttactttata - ... - \end{Output} --\end{AboutUniquesub} - --\begin{AboutMatstat} --\begin{Output} --gt matstat -output subjectpos querypos sequence -min 20 -max 30 -- -query queryfile.fna -pck human-all --unit 0 (Mus musculus, chr 1, complete sequence) --22 20 390765125 actgtatctcaaaatataaa --253 21 258488266 gggaataaacatgtcattgag --254 20 258488267 ggaataaacatgtcattgag --275 20 900483549 taattctatttttctttctt --480 20 1008274536 gcttgaagatcatgatccag --.. --\end{Output} --Here, the first column shows the relative positions in unit 0 for which the --length of the matching statistics is between 20 and 30. The second column is --the corresponding length value. The third column shows position of the --matching sequence in the index, and the fourth shows the sequence content. --\end{AboutMatstat} -- --\begin{AboutUniquesub} - %\bibliographystyle{plain} - %\bibliography{defines,kurtz} - \begin{thebibliography}{1} -@@ -280,5 +193,5 @@ - \newblock {\em {Bioinformatics}}, {23 ISMB/ECCB 2007}:{i195--i204}, 2007. - - \end{thebibliography} --\end{AboutUniquesub} -+ - \end{document} _______________________________________________ debian-med-commit mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit
