I'd prefer Make / MakeUnsafe as well.

Regards

Antoine.


Le 11/03/2019 à 15:15, Francois Saint-Jacques a écrit :
> I can settle with Micah's proposal of `MakeValidate`, though I'd prefer
> that Make is safe by default and responsible users use MakeUnsafe :), I'll
> settle with the pragmatic choice of backward compatibility.
> 
> François
> 
> 
> On Sun, Mar 10, 2019 at 5:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
> 
>> I think having consistent methods for both validated and unvalidated
>> construction is a good idea. Being fairly passionate about
>> microperformance, I don't think we should penalize responsible users
>> of unsafe/unvalidated APIs (e.g. by taking them away and replacing
>> them with variants featuring unavoidable computation), this can
>> partially be handled through developer documentation (which, ah, we
>> will need to write more of)
>>
>> On Sun, Mar 10, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>>
>>> I agree there should always be a path to avoid the validation but I
>> think there should also be an easy way to have validation included and a
>> clear way to tell the difference.  IMO, having strong naming convention so
>> callers can tell the difference, and code reviewers can focus more on less
>> safe method usage, is important.  I will help new-comers to the project
>> write safer code.  Which can either be refactored or called out in code
>> review for performance issues.  It also provides a cue for all developers
>> to consider if they are meeting the necessary requirements when using less
>> safe methods.
>>>
>>> A straw-man proposal for naming conventions:
>>> - Constructors are always unvalidated (should still have appropriate
>> DCHECKs)
>>> - "Make" calls are always unvalidated (should still have appropriate
>> DHCECKs)
>>> - "MakeValidated" ensures proper structural validation occur (but not
>> data is validation).
>>> - "MakeSanitized" ensures proper structural and data is validations occur
>>>
>>> As noted above it might only pay to refactor a small amount of current
>> usage to the safer APIs.
>>>
>>> We could potentially go even further down the rabbit hole and try to
>> define standard for a Hungarian notation [1] to make it more obvious what
>> invariants are expected for a particular data-structure variable (I'm
>> actually -.5 on this).
>>>
>>> As a personal bias, I would rather have slower code that has lower risk
>> of crashing in production than faster code that does.  Obviously, there is
>> a tradeoff here, and the ideal is faster code that won't segfault.
>>>
>>> Thoughts?
>>>
>>> -Micah
>>>
>>> [1]
>> https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/
>>>
>>> On Sun, Mar 10, 2019 at 9:38 AM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>>>>
>>>> hi folks,
>>>>
>>>> I think some issues are being conflated here, so let me try to dig
>>>> through them. Let's first look at the two cited bugs that were fixed,
>>>> if I have this right:
>>>>
>>>> * ARROW-4766: root cause dereferencing a null pointer
>>>> * ARROW-4774: root cause unsanitized Python user input
>>>>
>>>> None of the 4 remedies listed could have prevented ARROW-4766 AFAICT
>>>> since we currently allow for null buffers (the object, not the pointer
>>>> inside) in ArrayData. This has been discussed on the mailing list in
>>>> the past; "sanitizing" ArrayData to be free of null objects would be
>>>> expensive and my general attitude in the C++ library is that we should
>>>> not be in the business of paying extra CPU cycles for the 1-5% case
>>>> when it is unneeded in the 95-99% of cases. We have DCHECK assertions
>>>> to check these issues in debug builds while avoiding the costs in
>>>> release builds. In the case of checking edge cases in computational
>>>> kernels, suffice to say that we should check every kernel on length-0
>>>> input with null buffers to make sure this case is properly handled
>>>>
>>>> In the case of ARROW-4774, we should work at the language binding
>>>> interface to make sure we have convenient "validating" constructors
>>>> that check user input for common problems. This can prevent the
>>>> duplication of this code across the binding layers (GLib, Python, R,
>>>> MATLAB, etc.)
>>>>
>>>>  On the specific 4 steps mentioned by Francois, here are my thoughts:
>>>>
>>>> 1. Having StatusOr would be useful, but this is a programming
>> convenience
>>>>
>>>> 2. There are a couple purposes of the static factory methods that
>>>> exist now, like Table::Make and RecordBatch::Make. One of the reasons
>>>> that I added them initially was because of the implicit constructor
>>>> behavior of std::vector inside a call to std::make_shared. If you have
>>>> a std::vector<T> argument in a class's constructor, then
>>>> std::make_shared<Klass>(..., {...}) will not result in the initializer
>>>> list constructing the std::vector<T>. This means some awkwardness like
>>>> having to assign a std::vector<T> lvalue and _then_ pass that to
>>>> std::make_shared<Klass>(..., vector_arg, ...).
>>>>
>>>> I do not agree with refactoring these methods to use "validating"
>>>> constructors. Users of these C++ APIs should know what their
>>>> requirements are, and we provide in some cases a Validate() to spend
>>>> the extra cycles to assert preconditions. Therefore:
>>>>
>>>> 3. -1 on this
>>>> 4. -1 also
>>>>
>>>> Thanks
>>>> Wes
>>>>
>>>> On Sat, Mar 9, 2019 at 9:58 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>>>>
>>>>> HI François,
>>>>> This sounds great.  I would hope that as part of this we document the
>>>>> invariants (and possible sharp edges like zero length/all null and no
>> null
>>>>> Arrays).
>>>>>
>>>>> Is your intent to allow languages binding to the C++ library go
>> through the
>>>>> new API or will they still be able to use the "dangerous" ones?
>>>>>
>>>>> -Micah
>>>>>
>>>>> On Fri, Mar 8, 2019 at 6:16 PM Francois Saint-Jacques <
>>>>> fsaintjacq...@gmail.com> wrote:
>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> I noted that the current C++ API permits constructing core objects
>> breaking
>>>>>> said classes invariants. The following recent issues were affected
>> by this:
>>>>>>
>>>>>> - ARROW-4766: segfault due to invalid ArrayData with nullptr buffer
>>>>>> - ARROW-4774: segfault due to invalid Table with columns of
>> different
>>>>>> length
>>>>>>
>>>>>> Consumers often assumes that the objects respect the invariants,
>> e.g. by
>>>>>> dereferencing `array_data->buffers[i]->data()` or
>>>>>> `Array::GetValue(table.n_rows - 1)` .
>>>>>>
>>>>>> Sample of classes which requires invariant in the constructor:
>>>>>> - ArrayData/Array: number and size of buffers depending on type
>>>>>> - ChunkedArray: Arrays of same type
>>>>>> - Column: same as ChunkedArray and Field must match array's type
>>>>>> - RecordBatch: number of columns and schema must match, columns of
>> same
>>>>>> size
>>>>>> - Table: columns must be of same size
>>>>>>
>>>>>> Some classes provide static factory methods, notably:
>>>>>> - Most `shared_ptr<T> *::Make` methods, but they lack the Status
>> return
>>>>>> type to indicate
>>>>>>   failure, the consumer can at least check for nullptr
>>>>>> - `Status Table::FromRecordBatches(..., shared_ptr<T> *out)` is a
>> good
>>>>>> candidate to follow
>>>>>>
>>>>>> I suspect that mis-usage is only going to grow with the number of
>> users and
>>>>>> language that binds to C++. I would like to propose a plan to
>> tackle for
>>>>>> the
>>>>>> 0.14.0 release
>>>>>>
>>>>>> 1. Implement `StatusOr` (ARROW-4800), providing a cleaner API by
>> minimizing
>>>>>>    output parameters.
>>>>>> 2. Refactor normalized factory methods for each core object
>> (ArrayData,
>>>>>>    ChunkedArray, Column, RecordBatch, Table)
>>>>>>     - Common naming, I suggest we stay with `Make`.
>>>>>>     - Common return type, `StatusOr<shared_ptr<T>>`
>>>>>> 3. Refactor existing Make methods to use new methods but preserve
>> original
>>>>>>    signature by losing error message, on top of marking them
>> deprecated.
>>>>>> 4. Mark non-validating constructors as deprecated and ideailly make
>> every
>>>>>> "dangerous" constructor non-public.
>>>>>>
>>>>>> We'd give 1-2 release for consumers to stop using the deprecated
>>>>>> methods/constructors.
>>>>>>
>>>>>> François
>>>>>>
>>
> 

Reply via email to