I'd prefer Make / MakeUnsafe as well.
Regards Antoine. Le 11/03/2019 à 15:15, Francois Saint-Jacques a écrit : > I can settle with Micah's proposal of `MakeValidate`, though I'd prefer > that Make is safe by default and responsible users use MakeUnsafe :), I'll > settle with the pragmatic choice of backward compatibility. > > François > > > On Sun, Mar 10, 2019 at 5:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> I think having consistent methods for both validated and unvalidated >> construction is a good idea. Being fairly passionate about >> microperformance, I don't think we should penalize responsible users >> of unsafe/unvalidated APIs (e.g. by taking them away and replacing >> them with variants featuring unavoidable computation), this can >> partially be handled through developer documentation (which, ah, we >> will need to write more of) >> >> On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>> >>> I agree there should always be a path to avoid the validation but I >> think there should also be an easy way to have validation included and a >> clear way to tell the difference. IMO, having strong naming convention so >> callers can tell the difference, and code reviewers can focus more on less >> safe method usage, is important. I will help new-comers to the project >> write safer code. Which can either be refactored or called out in code >> review for performance issues. It also provides a cue for all developers >> to consider if they are meeting the necessary requirements when using less >> safe methods. >>> >>> A straw-man proposal for naming conventions: >>> - Constructors are always unvalidated (should still have appropriate >> DCHECKs) >>> - "Make" calls are always unvalidated (should still have appropriate >> DHCECKs) >>> - "MakeValidated" ensures proper structural validation occur (but not >> data is validation). >>> - "MakeSanitized" ensures proper structural and data is validations occur >>> >>> As noted above it might only pay to refactor a small amount of current >> usage to the safer APIs. >>> >>> We could potentially go even further down the rabbit hole and try to >> define standard for a Hungarian notation [1] to make it more obvious what >> invariants are expected for a particular data-structure variable (I'm >> actually -.5 on this). >>> >>> As a personal bias, I would rather have slower code that has lower risk >> of crashing in production than faster code that does. Obviously, there is >> a tradeoff here, and the ideal is faster code that won't segfault. >>> >>> Thoughts? >>> >>> -Micah >>> >>> [1] >> https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/ >>> >>> On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmck...@gmail.com> >> wrote: >>>> >>>> hi folks, >>>> >>>> I think some issues are being conflated here, so let me try to dig >>>> through them. Let's first look at the two cited bugs that were fixed, >>>> if I have this right: >>>> >>>> * ARROW-4766: root cause dereferencing a null pointer >>>> * ARROW-4774: root cause unsanitized Python user input >>>> >>>> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT >>>> since we currently allow for null buffers (the object, not the pointer >>>> inside) in ArrayData. This has been discussed on the mailing list in >>>> the past; "sanitizing" ArrayData to be free of null objects would be >>>> expensive and my general attitude in the C++ library is that we should >>>> not be in the business of paying extra CPU cycles for the 1-5% case >>>> when it is unneeded in the 95-99% of cases. We have DCHECK assertions >>>> to check these issues in debug builds while avoiding the costs in >>>> release builds. In the case of checking edge cases in computational >>>> kernels, suffice to say that we should check every kernel on length-0 >>>> input with null buffers to make sure this case is properly handled >>>> >>>> In the case of ARROW-4774, we should work at the language binding >>>> interface to make sure we have convenient "validating" constructors >>>> that check user input for common problems. This can prevent the >>>> duplication of this code across the binding layers (GLib, Python, R, >>>> MATLAB, etc.) >>>> >>>> On the specific 4 steps mentioned by Francois, here are my thoughts: >>>> >>>> 1. Having StatusOr would be useful, but this is a programming >> convenience >>>> >>>> 2. There are a couple purposes of the static factory methods that >>>> exist now, like Table::Make and RecordBatch::Make. One of the reasons >>>> that I added them initially was because of the implicit constructor >>>> behavior of std::vector inside a call to std::make_shared. If you have >>>> a std::vector<T> argument in a class's constructor, then >>>> std::make_shared<Klass>(..., {...}) will not result in the initializer >>>> list constructing the std::vector<T>. This means some awkwardness like >>>> having to assign a std::vector<T> lvalue and _then_ pass that to >>>> std::make_shared<Klass>(..., vector_arg, ...). >>>> >>>> I do not agree with refactoring these methods to use "validating" >>>> constructors. Users of these C++ APIs should know what their >>>> requirements are, and we provide in some cases a Validate() to spend >>>> the extra cycles to assert preconditions. Therefore: >>>> >>>> 3. -1 on this >>>> 4. -1 also >>>> >>>> Thanks >>>> Wes >>>> >>>> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>>>> >>>>> HI François, >>>>> This sounds great. I would hope that as part of this we document the >>>>> invariants (and possible sharp edges like zero length/all null and no >> null >>>>> Arrays). >>>>> >>>>> Is your intent to allow languages binding to the C++ library go >> through the >>>>> new API or will they still be able to use the "dangerous" ones? >>>>> >>>>> -Micah >>>>> >>>>> On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques < >>>>> fsaintjacq...@gmail.com> wrote: >>>>> >>>>>> Greetings, >>>>>> >>>>>> I noted that the current C++ API permits constructing core objects >> breaking >>>>>> said classes invariants. The following recent issues were affected >> by this: >>>>>> >>>>>> - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer >>>>>> - ARROW-4774: segfault due to invalid Table with columns of >> different >>>>>> length >>>>>> >>>>>> Consumers often assumes that the objects respect the invariants, >> e.g. by >>>>>> dereferencing `array_data->buffers[i]->data()` or >>>>>> `Array::GetValue(table.n_rows - 1)` . >>>>>> >>>>>> Sample of classes which requires invariant in the constructor: >>>>>> - ArrayData/Array: number and size of buffers depending on type >>>>>> - ChunkedArray: Arrays of same type >>>>>> - Column: same as ChunkedArray and Field must match array's type >>>>>> - RecordBatch: number of columns and schema must match, columns of >> same >>>>>> size >>>>>> - Table: columns must be of same size >>>>>> >>>>>> Some classes provide static factory methods, notably: >>>>>> - Most `shared_ptr<T> *::Make` methods, but they lack the Status >> return >>>>>> type to indicate >>>>>> failure, the consumer can at least check for nullptr >>>>>> - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a >> good >>>>>> candidate to follow >>>>>> >>>>>> I suspect that mis-usage is only going to grow with the number of >> users and >>>>>> language that binds to C++. I would like to propose a plan to >> tackle for >>>>>> the >>>>>> 0.14.0 release >>>>>> >>>>>> 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by >> minimizing >>>>>> output parameters. >>>>>> 2. Refactor normalized factory methods for each core object >> (ArrayData, >>>>>> ChunkedArray, Column, RecordBatch, Table) >>>>>> - Common naming, I suggest we stay with `Make`. >>>>>> - Common return type, `StatusOr<shared_ptr<T>>` >>>>>> 3. Refactor existing Make methods to use new methods but preserve >> original >>>>>> signature by losing error message, on top of marking them >> deprecated. >>>>>> 4. Mark non-validating constructors as deprecated and ideailly make >> every >>>>>> "dangerous" constructor non-public. >>>>>> >>>>>> We'd give 1-2 release for consumers to stop using the deprecated >>>>>> methods/constructors. >>>>>> >>>>>> François >>>>>> >> >