Re: [Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Martin Morgan Thu, 10 Nov 2016 03:24:37 -0800

On 11/10/2016 05:36 AM, Kevin RUE wrote:

Hi Martin,


Thanks for spending the time and effort to refactor and integrate my
code into BiocCheck. I didn't intend to give you so much to think about !

For the record, I used the "checkFunctionLengths" method (now starting
at line 829 of checks.R) as a template to offer code somewhat homogenous
with what I could see in place already. I believe that function should
also be refactored using your new "Context" and "handleContext" helpers,
right?

yep


Side note:
Far from trying to compete with the BiocCheck Bioconductor package, I
couldn't help but update my "follow-up" "BiocCheckTools" Python code

I think the right place for this detailed stuff is actually in lintr and/ or formatR.

<https://github.com/kevinrue/BiocCheckTools/blob/master/README.md> in
the meantime (sorry for unoriginal name, I'll take suggestions!) taking
Kasper's advice on board:

  * Single master file which runs everything and gives a report.
      o *Bonus*: each module can be run separately (line_chars; tab_width).
  * All line numbers are reported
  * Content of the line is not displayed
  * I didn't mark up TABs, but I indicate how many spaces they are made
    of (when not a multiple of 4)

All the best,
Kevin


On Wed, Nov 9, 2016 at 11:31 PM, Martin Morgan
<martin.mor...@roswellpark.org <mailto:martin.mor...@roswellpark.org>>
wrote:

    On 11/03/2016 08:14 AM, Kevin RUE wrote:

        Apologies for the additional spam, for two reasons:

          * The diff files that I've previously sent had the base and
        modified
            versions swapped. This new one fixes that.
          * This new diff file (always relative to the code I cloned from
            Bioconductor-mirror) also fixes a bug whereby the updated
        code would
            fail on packages that do not break any of the three guidelines.


    Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.

    The report is for the first 6 lines of offenders; the code for
    finding all offenders is mentioned in the vignette

      BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)

    Marcel Ramos mentions the lintr (CRAN) and goodpractice (github
    only, https://github.com/MangoTheCat/goodpractice
    <https://github.com/MangoTheCat/goodpractice>) packages as useful
    resources for these types of questions; also formatR (CRAN).

    For the record, it was fun to work through your code. The basic
    pattern was to read lines and flag those that violated a (usually
    simple) test, add these to a data frame, then if after parsing all
    files the data frame had any rows report these to the user.

    I standardized the columns of the data frame across checks to
    facilitate re-use, and imposed the standard by using a constructor
    -- non-exported Context(). I took a little bit of pain to make the
    constructor work (return a data frame with the correct columns but
    no rows) when there are no lines of code flagged to be added to the
    context, so that I could use it without have to insert a bunch of
    conditional `if (...)` in my code that make it both harder to
    understand and harder to test. Also, the constructor is vectorized,
    whereas your implementation was iterative.

    For reporting to the user, I wrote a small helper handleContext()
    that I could re-use across the different types of checks (and likely
    elsewhere in the package) and also maintain the separation of what
    is reported to the user (e.g., in handleContext()) from how it is
    reported to the user -- .msg, .verbatim, etc in the utils.R file.
    Probably the reporting could have been refactored further.

    I feel guilty about not writing unit tests; the package does include
    tests, including a handy package-generator function to make packages
    that are invalid in various ways. It seems like a better way to go
    would be to further refactor the body of the checkFormatting()
    function so that one could for instance invoke a function
    checkLineLengths(pkg, file, lines) and get back a data.frame created
    by Context(); it would then be easy to test these for various
    'lines' without having to figure out how to make test packages. But
    this seemed of course like too much additional work for this feature
    (I'll probably regret not expending the effort at a future date).

    I also found myself using the horrible 'copy-and-append' paradigm

       ctxt = Context()
       for (fl in files) {
           ...
           ctxt = rbind(ctxt, Context(...)
        }

    which makes a copy of (the increasingly large) ctxt each time
    through the loop and therefore scales quadratically (when each file
    has some problems) with the number of files. I'm counting on the
    number of files to be small (e.g., less than 10000, when the
    copy-and-append pattern starts to be expensive even for trivial
    operations; mzR has ~8750 files, though many of these are not
    checked for formatting) so that this pattern does not become a
    bottleneck. It was still painful to use...

    Martin


        Best,
        Kevin


        On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinru...@gmail.com
        <mailto:kevinru...@gmail.com>
        <mailto:kevinru...@gmail.com <mailto:kevinru...@gmail.com>>> wrote:

            Hi all,

            Please find attached the diff relative to the code that I cloned
            from Bioconductor-mirror yesterday (please ignore the
        previous diff
            file).

            Basically three new features:

              * As per previous email: display up to the first 6 lines
        that are
                over 80 characters long
              * *New*: display up to the first 6 lines that are not
        indented by
                a multiple of 4 spaces
              * *New*: display up to the first 6 lines that use TAB
        instead of 4
                spaces for indentation

            I also attach the output of the updated code
            <https://github.com/kevinrue/BiocCheck
        <https://github.com/kevinrue/BiocCheck>>, to illustrate the changes.

            Notes:

              * For demonstration purpose, I indented a handful of lines
        from
                the checks.R file itself with TAB characters. I assume
        that's
                OK, as some lines were already longer than 80 characters
        and not
                indented by a multiple of 4 spaces.

            All the best,
            Kevin


            On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE
        <kevinru...@gmail.com <mailto:kevinru...@gmail.com>
            <mailto:kevinru...@gmail.com <mailto:kevinru...@gmail.com>>>
        wrote:

                Me again :)

                Please find attached the first patch to print the first
        6 lines
                over 80 characters long. (I'll get to the tabulation
        offenders
                next).

                Note that all the offending lines are stored in the
        "df.length"
                data.frame. How about an option like
        "fullReport=c(FALSE, TRUE)"
                that print *all* the offending lines?
                The data.frame also stores the content of the lines for the
                record, but does not print them. I think Kasper is right:
                filename and line should be enough to track down the line.

                All the best,
                Kevin



                On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE
        <kevinru...@gmail.com <mailto:kevinru...@gmail.com>
                <mailto:kevinru...@gmail.com
        <mailto:kevinru...@gmail.com>>> wrote:

                    Thanks for the feedback!

                    I also tend to prefer *all* the lines being reported
        (or to
                    be honest, that was really true when I had lots of
        them; a
                    problem that I largely mitigated by fixing all of
        them once
                    and subsequently paying more attention while
        developing).

                    Printing the content of the offending line somewhat
        helps me
                    spot the line faster (more so for tab issues). But I
        must
                    admit that showing the whole line is somewhat
        "overkill". I
                    just started thinking of a compromise being to only
        show the
                    first N characters of the line, with N being 80
        minus the
                    number of characters necessary to print the filename and
                    line number.

                    Thanks Martin for pointing out the lines in
        BiocCheck. (Now
                    I feel bad for not having checked sooner.. hehe!)
                    I think the idea of BiocCheck showing the first 6
        offenders
                    in BiocCheck quite nice, as I rarely have more since
        I use
                    using the RStudio "Tools > Global Options > Code >
        Display >
                    Show Margin > Margin column: 80" feature.

                    I'll give a go at both approaches (developing
        BiocCheck and
                    my own scripts)

                    Cheers,
                    Kevin


                    On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
                    <kasperdanielhan...@gmail.com
        <mailto:kasperdanielhan...@gmail.com>
                    <mailto:kasperdanielhan...@gmail.com
        <mailto:kasperdanielhan...@gmail.com>>> wrote:

                        I would prefer all line numbers reported, but on the
                        other hand I am indifferent wrt. the content of the
                        line, unless (say) TABs are marked up somehow.

                        Kasper

                        On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
                        <martin.mor...@roswellpark.org
        <mailto:martin.mor...@roswellpark.org>
                        <mailto:martin.mor...@roswellpark.org
        <mailto:martin.mor...@roswellpark.org>>> wrote:

                            On 11/02/2016 02:49 PM, Kevin RUE wrote:

                                Dear all,

                                Just thought I'd share a handful of
        scripts that
                                I wrote to follow up on
                                certain NOTE messages thrown by R CMD
        BiocCheck.


        https://github.com/kevinrue/BiocCheckTools
        <https://github.com/kevinrue/BiocCheckTools>

        <https://github.com/kevinrue/BiocCheckTools
        <https://github.com/kevinrue/BiocCheckTools>>

                                They're very simple, but I occasionally find
                                them quite convenient.
                                Apologies if something similar already
        exists
                                somewhere :)


                            Maybe consider creating a diff against the
        source
                            code that, e.g., reported the first 6
        offenders? The
                            relevant lines are near


        
https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081
        
<https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081>

        
<https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081
        
<https://github.com/Bioconductor-mirror/BiocCheck/blob/master/R/checks.R#L1081>>

                            Martin


                                All the best,
                                Kevin

                                        [[alternative HTML version deleted]]


        _______________________________________________
                                Bioc-devel@r-project.org
        <mailto:Bioc-devel@r-project.org>
                                <mailto:Bioc-devel@r-project.org
        <mailto:Bioc-devel@r-project.org>> mailing list

        https://stat.ethz.ch/mailman/listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>



                            This email message may contain legally
        privileged
                            and/or...{{dropped:2}}


                            _______________________________________________
                            Bioc-devel@r-project.org
        <mailto:Bioc-devel@r-project.org>
                            <mailto:Bioc-devel@r-project.org
        <mailto:Bioc-devel@r-project.org>> mailing list

        https://stat.ethz.ch/mailman/listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>








    This email message may contain legally privileged and/or
    confidential information.  If you are not the intended recipient(s),
    or the employee or agent responsible for the delivery of this
    message to the intended recipient(s), you are hereby notified that
    any disclosure, copying, distribution, or use of this email message
    is prohibited.  If you have received this message in error, please
    notify the sender immediately by e-mail and delete this email
    message from your computer. Thank you.



This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] A handful of check to follow up on R CMD BiocCheck

Reply via email to